OVO 
THEORY ano 


Russell Davidson | James 6. MacKinnon 


Chapter 1 


Regression Models 


1.1 Introduction 


Regression models form the core of the discipline of econometrics. Although 
econometricians routinely estimate a wide variety of statistical models, using 
many different types of data, the vast majority of these are either regression 
models or close relatives of them. In this chapter, we introduce the concept of 
a regression model, discuss several varieties of them, and introduce the estima- 
tion method that is most commonly used with regression models, namely, least 
squares. This estimation method is derived by using the method of moments, 
which is a very general principle of estimation that has many applications in 
econometrics. 


The most elementary type of regression model is the simple linear regression 
model, which can be expressed by the following equation: 


Yt = Br + BoXe + ut. (1.01) 


The subscript t is used to index the observations of a sample. The total num- 
ber of observations, also called the sample size, will be denoted by n. Thus, 
for a sample of size n, the subscript t runs from 1 to n. Each observation 
comprises an observation on a dependent variable, written as yp for observa- 
tion t, and an observation on a single explanatory variable, or independent 
variable, written as X;. 


The relation (1.01) links the observations on the dependent and the explana- 
tory variables for each observation in terms of two unknown parameters, 61 
and 62, and an unobserved error term, u;. Thus, of the five quantities that 
appear in (1.01), two, y and X;, are observed, and three, 3), G2, and uz, are 
not. Three of them, yt, X+, and uz, are specific to observation t, while the 
other two, the parameters, are common to all n observations. 


Here is a simple example of how a regression model like (1.01) could arise in 
economics. Suppose that the index t is a time index, as the notation suggests. 
Each value of t could represent a year, for instance. Then y; could be house- 
hold consumption as measured in year t, and X; could be measured disposable 
income of households in the same year. In that case, (1.01) would represent 
what in elementary macroeconomics is called a consumption function. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 3 


4 Regression Models 


If for the moment we ignore the presence of the error terms, (2 is the marginal 
propensity to consume out of disposable income, and (3; is what is sometimes 
called autonomous consumption. As is true of a great many econometric mod- 
els, the parameters in this example can be seen to have a direct interpretation 
in terms of economic theory. The variables, income and consumption, do in- 
deed vary in value from year to year, as the term “variables” suggests. In 
contrast, the parameters reflect aspects of the economy that do not vary, but 
take on the same values each year. 


The purpose of formulating the model (1.01) is to try to explain the observed 
values of the dependent variable in terms of those of the explanatory variable. 
According to (1.01), for each t, the value of y; is given by a linear function 
of X;, plus what we have called the error term, u+. The linear (strictly speak- 
ing, affine!) function, which in this case is 61 + 62X, is called the regression 
function. At this stage we should note that, as long as we say nothing about 
the unobserved quantity uz, (1.01) does not tell us anything. In fact, we can 
allow the parameters (; and (2 to be quite arbitrary, since, for any given (1 
and (2, (1.01) can always be made to be true by defining u; suitably. 


If we wish to make sense of the regression model (1.01), then, we must make 
some assumptions about the properties of the error term us. Precisely what 
those assumptions are will vary from case to case. In all cases, though, it is 
assumed that uz is a random variable. Most commonly, it is assumed that, 
whatever the value of X;, the expectation of the random variable u; is zero. 
This assumption usually serves to identify the unknown parameters (, and 
(2, in the sense that, under the assumption, (1.01) can be true only for specific 
values of those parameters. 


The presence of error terms in regression models means that the explanations 
these models provide are at best partial. This would not be so if the error 
terms could be directly observed as economic variables, for then uz; could be 
treated as a further explanatory variable. In that case, (1.01) would be a 
relation linking y; to X; and uz in a completely unambiguous fashion. Given 
X and uz, yz would be completely explained without error. 


Of course, error terms are not observed in the real world. They are included 
in regression models because we are not able to specify all of the real-world 
factors that determine y+. When we set up our models with up as a ran- 
dom variable, what we are really doing is using the mathematical concept of 
randomness to model our ignorance of the details of economic mechanisms. 
What we are doing when we suppose that the mean of an error term is zero is 
supposing that the factors determining y+ that we ignore are just as likely to 
make y, bigger than it would have been if those factors were absent as they 
are to make y; smaller. Thus we are assuming that, on average, the effects 
of the neglected determinants tend to cancel out. This does not mean that 


1 A function g(x) is said to be affine if it takes the form g(a) = a+ ba for two 
real numbers a and b. 
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those effects are necessarily small. The proportion of the variation in y that 
is accounted for by the error term will depend on the nature of the data and 
the extent of our ignorance. Even if this proportion is large, as it will be in 
some cases, regression models like (1.01) can be useful if they allow us to see 
how yz is related to the variables, like X;, that we can actually observe. 


Much of the literature in econometrics, and therefore much of this book, is 
concerned with how to estimate, and test hypotheses about, the parameters 
of regression models. In the case of (1.01), these parameters are the constant 
term, or intercept, 61, and the slope coefficient, G2. Although we will begin 
our discussion of estimation in this chapter, most of it will be postponed until 
later chapters. In this chapter, we are primarily concerned with understanding 
regression models as statistical models, rather than with estimating them or 
testing hypotheses about them. 


In the next section, we review some elementary concepts from probability 
theory, including random variables and their expectations. Many readers will 
already be familiar with these concepts. They will be useful in Section 1.3, 
where we discuss the meaning of regression models and some of the forms 
that such models can take. In Section 1.4, we review some topics from matrix 
algebra and show how multiple regression models can be written using matrix 
notation. Finally, in Section 1.5, we introduce the method of moments and 
show how it leads to ordinary least squares as a way of estimating regression 
models. 


1.2 Distributions, Densities, and Moments 


The variables that appear in an econometric model are treated as what statis- 
ticians call random variables. In order to characterize a random variable, we 
must first specify the set of all the possible values that the random variable 
can take on. The simplest case is a scalar random variable, or scalar r.v. The 
set of possible values for a scalar r.v. may be the real line or a subset of the 
real line, such as the set of nonnegative real numbers. It may also be the set 
of integers or a subset of the set of integers, such as the numbers 1, 2, and 3. 


Since a random variable is a collection of possibilities, random variables cannot 
be observed as such. What we do observe are realizations of random variables, 
a realization being one value out of the set of possible values. For a scalar 
random variable, each realization is therefore a single real value. 


If X is any random variable, probabilities can be assigned to subsets of the 
full set of possibilities of values for X, in some cases to each point in that 
set. Such subsets are called events, and their probabilities are assigned by a 
probability distribution, according to a few general rules. 
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Discrete and Continuous Random Variables 


The easiest sort of probability distribution to consider arises when X is a 
discrete random variable, which can take on a finite, or perhaps a countably 
infinite number of values, which we may denote as £1, %2,.... The probability 
distribution simply assigns probabilities, that is, numbers between 0 and 1, 
to each of these values, in such a way that the probabilities sum to 1: 


oo 


53 Ce ae 


i=1 


where p(x;) is the probability assigned to x;. Any assignment of nonnega- 
tive probabilities that sum to one automatically respects all the general rules 
alluded to above. 


In the context of econometrics, the most commonly encountered discrete ran- 
dom variables occur in the context of binary data, which can take on the 
values 0 and 1, and in the context of count data, which can take on the values 
0, 1, 2,...; see Chapter 11. 


Another possibility is that X may be a continuous random variable, which, for 
the case of a scalar r.v., can take on any value in some continuous subset of the 
real line, or possibly the whole real line. The dependent variable in a regression 
model is normally a continuous r.v. For a continuous r.v., the probability 
distribution can be represented by a cumulative distribution function, or CDF. 
This function, which is often denoted F(x), is defined on the real line. Its 
value is Pr(X < x), the probability of the event that X is equal to or less 
than some value x. In general, the notation Pr(A) signifies the probability 
assigned to the event A, a subset of the full set of possibilities. Since X is 
continuous, it does not really matter whether we define the CDF as Pr(X < x) 
or as Pr(X < x) here, but it is conventional to use the former definition. 


Notice that, in the preceding paragraph, we used X to denote a random 
variable and x to denote a realization of X, that is, a particular value that the 
random variable X may take on. This distinction is important when discussing 
the meaning of a probability distribution, but it will rarely be necessary in 
most of this book. 


Probability Distributions 


We may now make explicit the general rules that must be obeyed by proba- 

bility distributions in assigning probabilities to events. There are just three 

of these rules: 

(i) All probabilities lie between 0 and 1; 

(ii) The null set is assigned probability 0, and the full set of possibilities is 
assigned probability 1; 

(iii) The probability assigned to an event that is the union of two disjoint 
events is the sum of the probabilities assigned to those disjoint events. 
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We will not often need to make explicit use of these rules, but we can use 
them now in order to derive some properties of any well-defined CDF for a 
scalar r.v. First, a CDF F(x) tends to 0 as z — —oo. This follows because 
the event (X < x) tends to the null set as x — —oo, and the null set has 
probability 0. By similar reasoning, F(x) tends to 1 when z — +00, because 
then the event (X < x) tends to the entire real line. Further, F(x) must be 
a weakly increasing function of x. This is true because, if 71 < x2, we have 


(X < £2) = (X < zı) U (a4 <X< £2), (1.02) 


where U is the symbol for set union. The two subsets on the right-hand side 
of (1.02) are clearly disjoint, and so 


Pr(X < z2) = Pr(X < z1)+ Pr(xı < X < xə). 


Since all probabilities are nonnegative, it follows that the probability that 
(X < z2) must be no smaller than the probability that (X < x1). 

For a continuous r.v., the CDF assigns probabilities to every interval on the 
real line. However, if we try to assign a probability to a single point, the result 
is always just zero. Suppose that X is a scalar r.v. with CDF F(x). For any 
interval [a,b] of the real line, the fact that F(x) is weakly increasing allows 
us to compute the probability that X € [a,b]. If a < b, 


Pr(X <b)=Pr(X <a)+Pr(a< X <b), 
whence it follows directly from the definition of a CDF that 
Pr(a < X < b) = F(b) — F(a), (1.03) 


since, for a continuous r.v., we make no distinction between Pr(a < X < b) 
and Pr(a < X < b). If we set b = a, in the hope of obtaining the probability 
that X = a, then we get F(a) — F(a) = 0. 


Probability Density Functions 

For continuous random variables, the concept of a probability density func- 
tion, or PDF, is very closely related to that of a CDF. Whereas a distribution 
function exists for any well-defined random variable, a PDF exists only when 
the random variable is continuous, and when its CDF is differentiable. For a 


scalar r.v., the density function, often denoted by f, is just the derivative of 
the CDF: 


f(z) = F(z). 
Because F'(—oo) = 0 and F(oo) = 1, every PDF must be normalized to 
integrate to unity. By the Fundamental Theorem of Calculus, 


ia fla)ae= f F'(@) de = Foe) ~ F(-00) =1. (1.04) 


It is obvious that a PDF is nonnegative, since it is the derivative of a weakly 
increasing function. 
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Standard Normal CDF: 
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Figure 1.1 The CDF and PDF of the standard normal distribution 


Probabilities can be computed in terms of the PDF as well as the CDF. Note 
that, by (1.03) and the Fundamental Theorem of Calculus once more, 


b 
Pr(a < X < b) = F(b) — F(a) = f f(x) de. (1.05) 


Since (1.05) must hold for arbitrary a and b, it is clear why f(x) must always be 
nonnegative. However, it is important to remember that f(x) is not bounded 
above by unity, because the value of a PDF at a point x is not a probability. 
Only when a PDF is integrated over some interval, as in (1.05), does it yield 
a probability. 

The most common example of a continuous distribution is provided by the 
normal distribution. This is the distribution that generates the famous or 
infamous “bell curve” sometimes thought to influence students’ grade distri- 
butions. The fundamental member of the normal family of distributions is the 
standard normal distribution. It is a continuous scalar distribution, defined 
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Figure 1.2 The CDF of a binary random variable 


on the entire real line. The PDF of the standard normal distribution is often 
denoted ¢(-). Its explicit expression, which we will need later in the book, is 


olx) = (2r) t? exp(— 52°). (1.06) 
Unlike ¢(-), the CDF, usually denoted ®(-), has no elementary closed-form 
expression. However, by (1.05) with a = —oo and b = zx, we have 


w(x) = f * $y) dy. 


The functions ®(-) and ¢(-) are graphed in Figure 1.1. Since the PDF is the 
derivative of the CDF, it achieves a maximum at x = 0, where the CDF is 
rising most steeply. As the CDF approaches both 0 and 1, and consequently, 
becomes very flat, the PDF approaches 0. 


Although it may not be obvious at once, discrete random variables can be 
characterized by a CDF just as well as continuous ones can be. Consider a 
binary r.v. X that can take on only two values, 0 and 1, and let the probability 
that X = 0 be p. It follows that the probability that X = 1 is 1— p. Then the 
CDF of X, according to the definition of F(x) as Pr(X < x), is the following 
discontinuous, “staircase” function: 


0 fore <0 
re)={> fr0<gz<1 

1 forr>i. 
This CDF is graphed in Figure 1.2. Obviously, we cannot graph a corre- 
sponding PDF, for it does not exist. For general discrete random variables, 
the discontinuities of the CDF occur at the discrete permitted values of X, and 
the jump at each discontinuity is equal to the probability of the corresponding 
value. Since the sum of the jumps is therefore equal to 1, the limiting value 
of F, to the right of all permitted values, is also 1. 
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Using a CDF is a reasonable way to deal with random variables that are 
neither completely discrete nor completely continuous. Such hybrid variables 
can be produced by the phenomenon of censoring. A random variable is said 
to be censored if not all of its potential values can actually be observed. For 
instance, in some data sets, a household’s measured income is set equal to 0 if 
it is actually negative. It might be negative if, for instance, the household lost 
more on the stock market than it earned from other sources in a given year. 
Even if the true income variable is continuously distributed over the positive 
and negative real line, the observed, censored, variable will have an atom, or 
bump, at 0, since the single value of 0 now has a nonzero probability attached 
to it, namely, the probability that an individual’s income is nonpositive. As 
with a purely discrete random variable, the CDF will have a discontinuity 
at 0, with a jump equal to the probability of a negative or zero income. 


Moments of Random Variables 


A fundamental property of a random variable is its expectation. For a discrete 
r.v. that can take on m possible finite values 71, 22,...,2%m, the expectation 
is simply 


E(X) = > p(e)ti. (1.07) 


Thus each possible value x; is multiplied by the probability associated with 
it. If m is infinite, the sum above has an infinite number of terms. 


For a continuous r.v., the expectation is defined analogously using the PDF: 


Co 
E(X) = / efi ae, (1.08) 
—Co 
Not every r.v. has an expectation, however. The integral of a density function 
always exists and equals 1. But since X can range from —oo to ov, the integral 
(1.08) may well diverge at either limit of integration, or both, if the density 
f does not tend to zero fast enough. Similarly, if m in (1.07) is infinite, the 
sum may diverge. The expectation of a random variable is sometimes called 
the mean or, to prevent confusion with the usual meaning of the word as the 

mean of a sample, the population mean. A common notation for it is u. 


The expectation of a random variable is often referred to as its first moment. 
The so-called higher moments, if they exist, are the expectations of the r.v. 
raised to a power. Thus the second moment of a random variable X is the 
expectation of X?, the third moment is the expectation of X°, and so on. In 
general, the k"? moment of a continuous random variable X is 


Observe that the value of any moment depends only on the probability distri- 
bution of the r.v. in question. For this reason, we often speak of the moments 
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of the distribution rather than the moments of a specific random variable. If 
a distribution possesses a k** moment, it also possesses all moments of order 
less than k. 


The higher moments just defined are called the uncentered moments of a 
distribution, because, in general, X does not have mean zero. It is often more 
useful to work with the central moments, which are defined as the ordinary 
moments of the difference between the random variable and its expectation. 
Thus the kt” central moment of the distribution of a continuous r.v. X is 


Co 


me EBX -E= | (w= "Hede, 


—00 


where u = E(X). For a discrete X, the k* central moment is 


m 


ur = E(X — E(X))" = X p(z:)(z; — p)". 


i=1 


By far the most important central moment is the second. It is called the 
variance of the random variable and is frequently written as Var(X). Another 
common notation for a variance is o°. This notation underlines the important 
fact that a variance cannot be negative. The square root of the variance, o, 
is called the standard deviation of the distribution. Estimates of standard 
deviations are often referred to as standard errors, especially when the random 
variable in question is an estimated parameter. 


Multivariate Distributions 


A vector-valued random variable takes on values that are vectors. It can 
be thought of as several scalar random variables that have a single, joint 
distribution. For simplicity, we will focus on the case of bivariate random 
variables, where the vector is of length 2. A continuous, bivariate r.v. (X1, X2) 
has a distribution function 


F(21, £2) = Pr((X1 < 21) (X2 < 22)), 


where N is the symbol for set intersection. Thus F(x1, 22) is the joint proba- 
bility that both Xı < zı and Xə < z2. For continuous variables, the PDF, if 
it exists, is the joint density function? 


0? F (x1, £2) 


Psl ~ 0x1 0x2 


(1.09) 


2 Here we are using what computer scientists would call “overloaded function” 
notation. This means that F(-) and f(-) denote respectively the CDF and the 
PDF of whatever their argument(s) happen to be. This practice is harmless 
provided there is no ambiguity. 
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This function has exactly the same properties as an ordinary PDF. In partic- 
ular, as in (1.04), 


f f (21,22) dx idx = 1. 


More generally, the probability that Xı and Xə jointly lie in any region is the 
integral of f(x1, £2) over that region. A case of particular interest is 


F (21,2) = Pr((X1 < 21) N (X2 < 22)) 
a pita (1.10) 
= f / F(y1, ye) dyıdy2, 


which shows how to compute the CDF given the PDF. 


The concept of joint probability distributions leads naturally to the impor- 
tant notion of statistical independence. Let (X1, X2) be a bivariate random 
variable. Then X, and Xə are said to be statistically independent, or often 
just independent, if the joint CDF of (X1, X2) is the product of the CDFs of 
Xı and Xə. In straightforward notation, this means that 


F(a1,22) = F(a, 00) F (00, £2). (1.11) 


The first factor here is the joint probability that Xı < zı and Xə < oo. Since 
the second inequality imposes no constraint, this factor is just the probability 
that Xı < xı. The function F'(#1, 00), which is called the marginal CDF of 
Xj, is thus just the CDF of Xı considered by itself. Similarly, the second 
factor on the right-hand side of (1.11) is the marginal CDF of X3. 


It is also possible to express statistical independence in terms of the marginal 
density of Xı and the marginal density of Xə. The marginal density of Xj is, 
as one would expect, the derivative of the marginal CDF of X4, 


f(x1) = Fi(@i,00), 


where F\(-) denotes the partial derivative of F(-) with respect to its first 
argument. It can be shown from (1.10) that the marginal density can also be 
expressed in terms of the joint density, as follows: 


f(wi) = T f (x1, £2) daz. (1.12) 


Thus f(xı) is obtained by integrating X2 out of the joint density. Similarly, 
the marginal density of Xə is obtained by integrating X, out of the joint 
density. From (1.09), it can be shown that, if X and X are independent, so 
that (1.11) holds, then 


f (1, £2) = f (a1) f (z2). (1.13) 


Thus, when densities exist, statistical independence means that the joint den- 
sity factorizes as the product of the marginal densities, just as the joint CDF 
factorizes as the product of the marginal CDFs. 
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ANB 


Figure 1.3 Conditional probability 


Conditional Probabilities 


Suppose that A and B are any two events. Then the probability of event A 
conditional on B, or given B, is denoted as Pr(A| B) and is defined implicitly 
by the equation 

Pr(AN B) = Pr(B)Pr(A| B). (1.14) 


For this equation to make sense as a definition of Pr(A | B), it is necessary that 
Pr(B) #0. The idea underlying the definition is that, if we know somehow 
that the event B has been realized, this knowledge can provide information 
about whether event A has also been realized. For instance, if A and B are 
disjoint, and B is realized, then it is certain that A has not been. As we 
would wish, this does indeed follow from the definition (1.14), since AN B is 
the null set, of zero probability, if A and B are disjoint. Similarly, if B is a 
subset of A, knowing that B has been realized means that A must have been 
realized as well. Since in this case Pr(AM B) = Pr(B), (1.14) tells us that 
Pr(A|B) = 1, as required. 


To gain a better understanding of (1.14), consider Figure 1.3. The bounding 
rectangle represents the full set of possibilities, and events A and B are sub- 
sets of the rectangle that overlap as shown. Suppose that the figure has been 
drawn in such a way that probabilities of subsets are proportional to their 
areas. Thus the probabilities of A and B are the ratios of the areas of the cor- 
responding circles to the area of the bounding rectangle, and the probability 
of the intersection AN B is the ratio of its area to that of the rectangle. 


Suppose now that it is known that B has been realized. This fact leads us 
to redefine the probabilities so that everything outside B now has zero prob- 
ability, while, inside B, probabilities remain proportional to areas. Event B 
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Figure 1.4 The CDF and PDF of the uniform distribution on (0, 1] 


will now have probability 1, in order to keep the total probability equal to 1. 
Event A can be realized only if the realized point is in the intersection AN B, 
since the set of all points of A outside this intersection have zero probability. 
The probability of A, conditional on knowing that B has been realized, is thus 
the ratio of the area of AN B to that of B. This construction leads directly 
to (1.14). 


There are many ways to associate a random variable X with the rectangle 
shown in Figure 1.3. Such a random variable could be any function of the 
two coordinates that define a point in the rectangle. For example, it could be 
the horizontal coordinate of the point measured from the origin at the lower 
left-hand corner of the rectangle, or its vertical coordinate, or the Euclidean 
distance of the point from the origin. The realization of X is the value of the 
function it corresponds to at the realized point in the rectangle. 


For concreteness, let us assume that the function is simply the horizontal 
coordinate, and let the width of the rectangle be equal to 1. Then, since 
all values of the horizontal coordinate between 0 and 1 are equally probable, 
the random variable X has what is called the uniform distribution on the 
interval [0,1]. The CDF of this distribution is 


0 forx2 <0 
Fo= fa fr0<gz<1 

1 fora >, 
Because F(x) is not differentiable at x = 0 and x = 1, the PDF of the 
uniform distribution does not exist at those points. Elsewhere, the derivative 
of F(x) is 0 outside [0,1] and 1 inside. The CDF and PDF are illustrated in 
Figure 1.4. This special case of the uniform distribution is often denoted the 
U(0,1) distribution. 
If the information were available that B had been realized, then the distri- 
bution of X conditional on this information would be very different from the 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


1.2 Distributions, Densities, and Moments 15 


F(a) f(z) 
3.0 5 
1.0 
2.0 4 
0.54 
1.05 
] x T T L 
0.0 0.5 1.0 0.0 0.5 1.0 
The CDF The PDF 


Figure 1.5 The CDF and PDF conditional on event B 


U (0,1) distribution. Now only values between the extreme horizontal limits 
of the circle of B are allowed. If one computes the area of the part of the 
circle to the left of a given vertical line, then for each event a = (X < x) the 
probability of this event conditional on B can be worked out. The result is 
just the CDF of X conditional on the event B. Its derivative is the PDF of 
X conditional on B. These are shown in Figure 1.5. 


The concept of conditional probability can be extended beyond probability 
conditional on an event to probability conditional on a random variable. Sup- 
pose that Xj is a r.v. and X% isa discrete r.v. with permitted values 21,..., Zm- 
For each i = 1,...,m, the CDF of Xj, and, if X1 is continuous, its PDF, can 
be computed conditional on the event (Xə = z;). If Xə is also a continuous 
r.v., then things are a little more complicated, because events like (X2 = x2) 
for some real x2 have zero probability, and so cannot be conditioned on in the 
manner of (1.14). 


On the other hand, it makes perfect intuitive sense to think of the distribution 
of X, conditional on some specific realized value of Xə. This conditional 
distribution gives us the probabilities of events concerning X, when we know 
that the realization of Xə was actually x2. We therefore make use of the 
conditional density of X; for a given value x2 of Xə. This conditional density, 
or conditional PDF, is defined as 


(1.15) 


Thus, for a given value x2 of Xo, the conditional density is proportional to the 
joint density of X; and X2. Of course, (1.15) is well defined only if f(x2) > 0. 
In some cases, more sophisticated definitions can be found that would allow 
f (x1 | £2) to be defined for all x2 even if f(x2) = 0, but we will not need these 
in this book. See, among others, Billingsley (1979). 
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Conditional Expectations 


Whenever we can describe the distribution of a random variable, X1, condi- 
tional on another, X2, either by a conditional CDF or a conditional PDF, 
we can consider the conditional expectation or conditional mean of X,. If it 
exists, this conditional expectation is just the ordinary expectation computed 
using the conditional distribution. If x22 is a possible value for X2, then this 
conditional expectation is written as E(X; | x2). 


For a given value x2, the conditional expectation E(X; | x2) is, like any other 
ordinary expectation, a deterministic, that is, nonrandom, quantity. But we 
can consider the expectation of Xı conditional on every possible realization 
of Xə. In this way, we can construct a new random variable, which we denote 
by E(X,| X2), the realization of which is E(X; | a2) when the realization of 
Xə is £2. We can call E(X, | X2) a deterministic function of the random vari- 
able X2, because the realization of E(X; | X2) is unambiguously determined 
by the realization of Xo. 

Conditional expectations defined as random variables in this way have a num- 


ber of interesting and useful properties. The first, called the Law of Iterated 
Expectations, can be expressed as follows: 


E(E(X1 | X2)) = E(X1). (1.16) 


If a conditional expectation of X, can be treated as a random variable, 
then the conditional expectation itself may have an expectation. According 
to (1.16), this expectation is just the ordinary expectation of X4. 


Another property of conditional expectations is that any deterministic func- 
tion of a conditioning variable Xə is its own conditional expectation. Thus, 
for example, E(X2|X2) = X2, and E(X3 | X2) = XŽ. Similarly, conditional 
on Xə, the expectation of a product of another random variable X, and a 
deterministic function of Xə is the product of that deterministic function and 
the expectation of X, conditional on Xo: 


E(X1h(X2) | X2) = h(X2) E(X1 | X2), (1.17) 


for any deterministic function h(-). An important special case of this, which 
we will make use of in Section 1.5, arises when E(X, | X2) = 0. In that case, 
for any function h(-), E(X 1h(X2)) = 0, because 


E(X1h(X2)) = E(E(X1h(X2) | X2)) 
= E(h(X2)E(Xq | X2)) 
=B =0: 


The first equality here follows from the Law of Iterated Expectations, (1.16). 
The second follows from (1.17). Since E(X, | X2) = 0, the third line then fol- 
lows immediately. We will present other properties of conditional expectations 
as the need arises. 
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1.3 The Specification of Regression Models 


We now return our attention to the regression model (1.01) and revert to the 
notation of Section 1.1 in which y; and X; respectively denote the dependent 
and independent variables. The model (1.01) can be interpreted as a model 
for the mean of y conditional on X;. Let us assume that the error term uz 
has mean 0 conditional on X;. Then, taking conditional expectations of both 
sides of (1.01), we see that 


E (yt | Xt) = G1 + b2Xı + E(u | Xt) = b1 + b2 Xi. 


Without the key assumption that E(u; |X) = 0, the second equality here 
would not hold. As we pointed out in Section 1.1, it is impossible to make 
any sense of a regression model unless we make strong assumptions about 
the error terms. Of course, we could define u; as the difference between 
yz and E(y,|X;), which would give E(u, | X+) = 0 by definition. But if we 
require that E(u; | X+) = 0 and also specify (1.01), we must necessarily have 
E(u | Xt) = G1 + GoXe. 


As an example, suppose that we estimate the model (1.01) when in fact 
Ye = Pr + B2Xt + b3 Xf + vr (1.18) 


with 63 Æ 0 and an error term v; such that E(v;| X+) = 0. If the data were 
generated by (1.18), the error term u in (1.01) would be equal to 83X? + vr. 
By the results on conditional expectations in the last section, we see that 


E(uz | Xz) = E(@3X? + v | Xt) = b3 X?, 


which we have assumed to be nonzero. This example shows the force of the 
assumption that the error term has mean zero conditional on X;. Unless the 
mean of y, conditional on X; really is a linear function of X+, the regression 
function in (1.01) is not correctly specified, in the precise sense that (1.01) 
cannot hold with an error term that has mean zero conditional on X;. It will 
become clear in later chapters that estimating incorrectly specified models 
usually leads to results that are meaningless or, at best, seriously misleading. 


Information Sets 


In a more general setting, what we are interested in is usually not the mean 
of y, conditional on a single explanatory variable X, but the mean of y; con- 
ditional on a set of potential explanatory variables. This set is often called 
an information set, and it is denoted Q. Typically, the information set will 
contain more variables than would actually be used in a regression model. For 
example, it might consist of all the variables observed by the economic agents 
whose actions determine y; at the time they make the decisions that cause 
them to perform those actions. Such an information set could be very large. 
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As a consequence, much of the art of constructing, or specifying, a regression 
model is deciding which of the variables that belong to Q, should be included 
in the model and which of the variables should be excluded. 


In some cases, economic theory makes it fairly clear what the information set 
Q; should consist of, and sometimes also which variables in Q, should make 
their way into a regression model. In many others, however, it may not be 
at all clear how to specify Q. In general, we want to condition on exogenous 
variables but not on endogenous ones. These terms refer to the origin or 
genesis of the variables: An exogenous variable has its origins outside the 
model under consideration, while the mechanism generating an endogenous 
variable is inside the model. When we write a single equation like (1.01), the 
only endogenous variable allowed is the dependent variable, yz. 


Recall the example of the consumption function that we looked at in Sec- 
tion 1.1. That model seeks to explain household consumption in terms of 
disposable income, but it makes no claim to explain disposable income, which 
is simply taken as given. The consumption function model can be correctly 
specified only if two conditions hold: 


(i) The mean of consumption conditional on disposable income is a linear 
function of the latter. 


(ii) Consumption is not a variable that contributes to the determination of 
disposable income. 


The second condition means that the origin of disposable income, that is, the 
mechanism by which disposable income is generated, lies outside the model for 
consumption. In other words, disposable income is exogenous in that model. 
If the simple consumption model we have presented is correctly specified, the 
two conditions above must be satisfied. Needless to say, we do not claim that 
this model is in fact correctly specified. 


It is not always easy to decide just what information set to condition on. As 
the above example shows, it is often not clear whether or not a variable is 
exogenous. This sort of question will be discussed in Chapter 8. Moreover, 
even if a variable clearly is exogenous, we may not want to include it in Q4. 
For example, if the ultimate purpose of estimating a regression model is to 
use it for forecasting, there may be no point in conditioning on information 
that will not be available at the time the forecast is to be made. 


Error Terms 


Whenever we specify a regression model, it is essential to make assumptions 
about the properties of the error terms. The simplest assumption is that all 
of the error terms have mean 0, come from the same distribution, and are 
independent of each other. Although this is a rather strong assumption, it is 
very commonly made in practice. 


Mutual independence of the error terms, when coupled with the assumption 
that E(uz) = 0, implies that the mean of uz is 0 conditional on all of the other 
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error terms us, s Æ t. However, the implication does not work in the other di- 
rection, because the assumption of mutual independence is stronger than the 
assumption about the conditional means. A very strong assumption which 
is often made is that the error terms are independently and identically dis- 
tributed, or IID. According to this assumption, the error terms are mutually 
independent, and they are in addition realizations from the same, identical, 
probability distribution. 


When the successive observations are ordered by time, it often seems plausible 
that an error term will be correlated with neighboring error terms. Thus uz 
might well be correlated with us when the value of |t — s| is small. This could 
occur, for example, if there is correlation across time periods of random factors 
that influence the dependent variable but are not explicitly accounted for in 
the regression function. This phenomenon is called serial correlation, and it 
often appears to be observed in practice. When there is serial correlation, the 
error terms cannot be IID because they are not independent. 


Another possibility is that the variance of the error terms may be systemat- 
ically larger for some observations than for others. This will happen if the 
conditional variance of y; depends on some of the same variables as the condi- 
tional mean. This phenomenon is called heteroskedasticity, and it too is often 
observed in practice. For example, in the case of the consumption function, the 
variance of consumption may well be higher for households with high incomes 
than for households with low incomes. When there is heteroskedasticity, the 
error terms cannot be IID, because they are not identically distributed. It is 
perfectly possible to take explicit account of both serial correlation and het- 
eroskedasticity, but doing so would take us outside the context of regression 
models like (1.01). 


It may sometimes be desirable to write a regression model like the one we 
have been studying as 


E (yt | Qe) = 21 + b2Xt, (1.19) 


in order to stress the fact that this is a model for the mean of y; conditional 
on a certain information set. However, by itself, (1.19) is just as incomplete 
a specification as (1.01). In order to see this point, we must now state what 
we mean by a complete specification of a regression model. Probably the 
best way to do this is to say that a complete specification of any econometric 
model is one that provides an unambiguous recipe for simulating the model 
on a computer. After all, if we can use the model to generate simulated data, 
it must be completely specified. 


Simulating Econometric Models 


Consider equation (1.01). When we say that we simulate this model, we 
mean that we generate numbers for the dependent variable, y;, according 
to equation (1.01). Obviously, one of the first things we must fix for the 
simulation is the sample size, n. That done, we can generate each of the yr, 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


20 Regression Models 


t =1,...,n, by evaluating the right-hand side of the equation n times. For 
this to be possible, we need to know the value of each variable or parameter 
that appears on the right-hand side. 


If we suppose that the explanatory variable X; is exogenous, then we simply 
take it as given. So if, in the context of the consumption function example, 
we had data on the disposable income of households in some country every 
year for a period of n years, we could just use those data. Our simulation 
would then be specific to the country in question and to the time period of 
the data. Alternatively, it could be that we or some other econometricians 
had previously specified another model, for the explanatory variable this time, 
and we could then use simulated data provided by that model. 


Besides the explanatory variable, the other elements of the right-hand side of 
(1.01) are the parameters, 3, and 62, and the error term us. The key feature 
of the parameters is that we do not know their true values. We will have 
more to say about this point in Chapter 3, when we define the twin concepts 
of models and data-generating processes. However, for purposes of simulation, 
we could use either values suggested by economic theory or values obtained 
by estimating the model. Evidently, the simulation results will depend on 
precisely what values we use. 


Unlike the parameters, the error terms cannot be taken as given; instead, we 
wish to treat them as random. Luckily, it is easy to use a computer to generate 
“random” numbers by using a program called a random number generator; we 
will discuss these programs in Chapter 4. The “random” numbers generated 
by computers are not random according to some meanings of the word. For 
instance, a computer can be made to spit out exactly the same sequence of 
supposedly random numbers more than once. In addition, a digital computer 
is a perfectly deterministic device. Therefore, if random means the opposite 
of deterministic, only computers that are not functioning properly would be 
capable of generating truly random numbers. Because of this, some people 
prefer to speak of computer-generated random numbers as pseudo-random. 
However, for the purposes of simulations, the numbers computers provide have 
all the properties of random numbers that we need, and so we will call them 
simply random rather than pseudo-random. 


Computer-generated random numbers are mutually independent drawings, 
or realizations, from specific probability distributions, usually the uniform 
U(0,1) distribution or the standard normal distribution, both of which were 
defined in Section 1.2. Of course, techniques exist for generating drawings 
from many other distributions as well, as do techniques for generating draw- 
ings that are not independent. For the moment, the essential point is that we 
must always specify the probability distribution of the random numbers we 
use in a simulation. It is important to note that specifying the expectation of 
a distribution, or even the expectation conditional on some other variables, is 
not enough to specify the distribution in full. 
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Let us now summarize the various steps in performing a simulation by giving 
a sort of generic recipe for simulations of regression models. In the model 
specification, it is convenient to distinguish between the deterministic spec- 
ification and the stochastic specification. In model (1.01), the deterministic 
specification consists of the regression function, of which the ingredients are 
the explanatory variable and the parameters. The stochastic specification 
(“stochastic” is another word for “random” ) consists of the probability distri- 
bution of the error terms, and the requirement that the error terms should be 
IID drawings from this distribution. Then, in order to simulate the dependent 
variable y; in (1.01), we do as follows: 


e Fix the sample size, n; 
e Choose the parameters (here 3; and 32) of the deterministic specification; 


e Obtain the n successive values X;, t = 1,...,n, of the explanatory vari- 
able. As explained above, these values may be real-world data or the 
output of another simulation; 


e Evaluate the n successive values of the regression function 3; + B2 X+, for 
$1,053 

e Choose the probability distribution of the error terms, if necessary spec- 
ifying parameters such as its mean and variance; 


e Use a random-number generator to generate the n successive and mutu- 
ally independent values u, of the error terms; 


e Form the n successive values y; of the dependent variable by adding the 
error terms to the values of the regression function. 


The n values y+, t = 1,...,n, thus generated are the output of the simulation; 
they are the simulated values of the dependent variable. 


The chief interest of such a simulation is that, if the model we simulate is 
correctly specified and thus reflects the real-world generating process for the 
dependent variable, our simulation mimics the real world accurately, because 
it makes use of the same data-generating mechanism as that in operation in 
the real world. 


A complete specification, then, is anything that leads unambiguously to a 
recipe like the one given above. We will define a fully specified parametric 
model as a model for which it is possible to simulate the dependent variable 
once the values of the parameters are known. A partially specified parametric 
model is one for which more information, over and above the parameter values, 
must be supplied before simulation is possible. Both sorts of models are 
frequently encountered in econometrics. 


To conclude this discussion of simulations, let us return to the specifications 
(1.01) and (1.19). Both are obviously incomplete as they stand. In order 
to complete either one, it is necessary to specify the information set Q, and 
the distribution of u; conditional on Q;. In particular, it is necessary to 
know whether the error terms us with s # t belong to Q;. In (1.19), one 
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aspect of the conditional distribution is given, namely, the conditional mean. 
Unfortunately, because (1.19) contains no explicit error term, it is easy to 
forget that it is there. Perhaps as a result, it is more common to write 
regression models in the form of (1.01) than in the form of (1.19). However, 
writing a model in the form of (1.01) does have the disadvantage that it 
obscures both the dependence of the model on the choice of an information 
set and the fact that the distribution of the error term must be specified 
conditional on that information set. 


Linear and Nonlinear Regression Models 


The simple linear regression model (1.01) is by no means the only reasonable 
model for the mean of y; conditional on X;. Consider, for example, the models 


ye = Br + b2Xı + bX? + un (1.20) 
Yt = 1 + Y2 log Xt + us, and (1.21) 
1 
yt = 01 + ra + Ut. (1.22) 
3 


These are all models that might be plausible in some circumstances.’ In 
equation (1.20), there is an extra parameter, (3, which allows E(y,| X+) to 
vary quadratically with X; whenever (3 is nonzero. In effect, X; and X? 
are being treated as separate explanatory variables. Thus (1.20) is the first 
example we have seen of a multiple linear regression model. It reduces to the 


simple linear regression model (1.01) when 33 = 0. 


In the models (1.21) and (1.22), on the other hand, there are no extra para- 
meters. Instead, a nonlinear transformation of X; is used in place of X; itself. 
As a consequence, the relationship between X; and E(y:| X+) in these two 
models is necessarily nonlinear. Nevertheless, (1.20), (1.21), and (1.22) are all 
said to be linear regression models, because, even though the mean of y may 
depend nonlinearly on X;, it always depends linearly on the unknown para- 
meters of the regression function. As we will see in Section 1.5, it is quite easy 
to estimate a linear regression model. In contrast, genuinely nonlinear mod- 
els, in which the regression function depends nonlinearly on the parameters, 
are somewhat harder to estimate; see Chapter 6. 


Because it is very easy to estimate linear regression models, a great deal 
of applied work in econometrics makes use of them. It may seem that the 
linearity assumption is very restrictive. However, as the examples (1.20), 
(1.21), and (1.22) illustrate, this assumption need not be unduly restrictive 
in practice, at least not if the econometrician is at all creative. If we are 
willing to transform the dependent variable as well as the independent ones, 


3 In this book, all logarithms are natural logarithms. Thus a = logx implies 
that x = e". Some authors use “In” to denote natural logarithms and “log” to 
denote base 10 logarithms. Since econometricians should never have any use 
for base 10 logarithms, we avoid this aesthetically displeasing notation. 
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the linearity assumption can be made even less restrictive. As an example, 
consider the nonlinear regression model 


ye = XE? XPS + un, (1.23) 


in which there are two explanatory variables, X;2 and X73, and the regression 
function is multiplicative. If the notation seems odd, suppose that there is 
implicitly a third explanatory variable, X+, which is constant and always 
equal to e. Notice that the regression function in (1.23) can be evaluated only 
when X2 and X;3 are positive for all t. It is a genuinely nonlinear regression 
function, since it is clearly linear neither in parameters nor in variables. For 
reasons that will shortly become apparent, a nonlinear model like (1.23) is 
very rarely estimated in practice. 


A model like (1.23) is not as outlandish as may appear at first glance. It 
could arise, for instance, if we wanted to estimate a Cobb-Douglas production 
function. In that case, y would be output for observation t, and X+2 and X13 
would be inputs, say labor and capital. Since e^! is just a positive constant, 
it plays the role of the scale factor that is present in every Cobb-Douglas 
production function. 


As (1.23) is written, everything enters multiplicatively except the error term. 
But it is easy to modify (1.23) so that the error term also enters multiplica- 
tively. One way to do this is to write 


ye = € XB XE + uy, = (eX XP) (14 v), (1.24) 


where the error factor 1 + v; multiplies the regression function. If we now 
assume that the underlying errors v; are IID, it follows that the additive 
errors uz are proportional to the regression function. This may well be a more 
plausible specification than that in which the u; are supposed to be IID, as 
was implicitly assumed in (1.23). To see this, notice first that the additive 
error uz has the same units of measurement as y+. If (1.23) is interpreted as 
a production function, then u; is measured in units of output. However, the 
multiplicative error v is dimensionless. In other words, it is a pure number, 
like 0.02, which could be expressed as 2 percent. If the up are assumed to be 
IID, then we are assuming that the error in output is of the same order of 
magnitude regardless of the scale of production. If, on the other hand, the v 
are assumed to be IID, then the error is proportional to total output. This 
second assumption is almost always more reasonable than the first. 


If the model (1.24) is a good one, the 1; should be quite small, usually less than 
about 0.05. For small values of the argument w, a standard approximation to 
the exponential function gives us that e” S 1 +w. As a consequence, (1.24) 
will be very similar to the model 


ye = PXB XB er, (1.25) 
whenever the error terms are reasonably small. 
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Now suppose we take logarithms of both sides of (1.25). The result is 
log Ut = By + Bo log Xto + b3 log X43 + Ut, (1.26) 


which is a loglinear regression model. This model is linear in the parameters 
and in the logarithms of all the variables, and so it is very much easier to esti- 
mate than the nonlinear model (1.23). Since (1.25) is at least as plausible as 
(1.23), it is not surprising that loglinear regression models, like (1.26), are es- 
timated very frequently in practice, while multiplicative models with additive 
error terms, like (1.23), are very rarely estimated. Of course, it is important 
to remember that (1.26) is not a model for the mean of y, conditional on X2 
and X3. Instead, it is a model for the mean of log y; conditional on those 
variables. If it is really the conditional mean of y that we are interested in, 
we will not want to estimate a loglinear model like (1.26). 


1.4 Matrix Algebra 


It is impossible to study econometrics beyond the most elementary level with- 
out using matrix algebra. Most readers are probably already quite familiar 
with matrix algebra. This section reviews some basic results that will be used 
throughout the book. It also shows how regression models can be written very 
compactly using matrix notation. More advanced material will be discussed 
in later chapters, as it is needed. 


An n X m matrix A is a rectangular array that consists of nm elements 
arranged in n rows and m columns. The name of the matrix is conventionally 
shown in boldface. A typical element of A might be denoted by either Aj; or 
aij, Where i=1,...,n and j =1,...,m. The first subscript always indicates 
the row, and the second always indicates the column. It is sometimes necessary 
to show the elements of a matrix explicitly, in which case they are arrayed in 
rows and columns and surrounded by large brackets, as in 


2 3 6 
B= l 
isl 


Here B is a 2 x 3 matrix. 


If a matrix has only one column or only one row, it is called a vector. There are 
two types of vectors, column vectors and row vectors. Since column vectors 
are more common than row vectors, a vector that is not specified to be a 
row vector is normally treated as a column vector. If a column vector has 
n elements, it may be referred to as an n-vector. Boldface is used to denote 
vectors as well as matrices. It is conventional to use uppercase letters for 
matrices and lowercase letters for column vectors. However, it is sometimes 
necessary to ignore this convention. 
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If a matrix has the same number of columns and rows, it is said to be square. 
A square matrix A is symmetric if Aj; = Aj; for all i and j. Symmetric 
matrices occur very frequently in econometrics. A square matrix is said to 
be diagonal if A;; = 0 for all i Æ j; in this case, the only nonzero entries are 
those on what is called the principal diagonal. Sometimes a square matrix 
has all zeros above or below the principal diagonal. Such a matrix is said to 
be triangular. If the nonzero elements are all above the diagonal, it is said to 
be upper-triangular; if the nonzero elements are all below the diagonal, it is 
said to be lower-triangular. Here are some examples: 


1 2 4 1 0 0 1 0 0 
A=]2 3 6 B=/;0 4 0 C=]3 2-0 
4 6 5 0 0 2 5 2 6 


In this case, A is symmetric, B is diagonal, and C is lower-triangular. 


The transpose of a matrix is obtained by interchanging its row and column 
subscripts. Thus the ij" element of A becomes the jit element of its trans- 
pose, which is denoted A’. Note that many authors use A’ rather than A! to 
denote the transpose of A. The transpose of a symmetric matrix is equal to 
the matrix itself. The transpose of a column vector is a row vector, and vice 
versa. Here are some examples: 


a=; 5 7 


| 2 3 2 
A'=|5 8| b=]|4| b'=[|2 4 6]. 
a ee 7 4 6 


Note that a matrix A is symmetric if and only if A = A’. 


Arithmetic Operations on Matrices 


Addition and subtraction of matrices works exactly the way it does for scalars, 
with the proviso that matrices can be added or subtracted only if they are 
conformable. In the case of addition and subtraction, this just means that 
they must have the same dimensions, that is, the same number of rows and 
the same number of columns. If A and B are conformable, then a typical 
element of A + B is simply Aij + Bij, and a typical element of A — B is 
Matrix multiplication actually involves both additions and multiplications. It 
is based on what is called the inner product, or scalar product, of two vectors. 
Suppose that a and b are n-vectors. Then their inner product is 


a'b = b! = aie 
i=1 


As the name suggests, this is just a scalar. 
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When two matrices are multiplied together, the ij" element of the result is 
equal to the inner product of the it row of the first matrix with the jt? 


column of the second matrix. Thus, if C = AB, 
Cig = XO An Bay: (1.27) 
k=1 


For (1.27) to make sense, we must assume that A has m columns and that 
B has m rows. In general, if two matrices are to be conformable for multipli- 
cation, the first matrix must have as many columns as the second has rows. 
Further, as is clear from (1.27), the result has as many rows as the first matrix 
and as many columns as the second. One way to make this explicit is to write 
something like 
A B=C. 
nxm mxl nxl 

One rarely sees this type of notation in a book or journal article. However, it 
is often useful to employ it when doing calculations, in order to verify that the 
matrices being multiplied are indeed conformable and to derive the dimensions 
of their product. 


The rules for multiplying matrices and vectors together are the same as the 
rules for multiplying matrices with each other; vectors are simply treated as 
matrices that have only one column or only one row. For instance, if we 
multiply an n-vector a by the transpose of an n-vector b, we obtain what is 
called the outer product of the two vectors. The result, written as ab’, is an 
n x n matrix with typical element a;b;. 


Matrix multiplication is, in general, not commutative. The fact that it is pos- 
sible to premultiply B by A does not imply that it is possible to postmultiply 
B by A. In fact, it is easy to see that both operations are possible if and only 
if one of the matrix products is square, in which case the other matrix product 
will be square also, although generally with different dimensions. Even when 
both operations are possible, AB # BA except in special cases. 


A special matrix that econometricians frequently make use of is I, which 
denotes the identity matrix. It is a diagonal matrix with every diagonal 
element equal to 1. A subscript is sometimes used to indicate the number of 
rows and columns. Thus 


10 0 
b= |0 1 0 
001 


The identity matrix is so called because when it is either premultiplied or 
postmultiplied by any matrix, it leaves the latter unchanged. Thus, for any 
matrix A, AI = IA = A, provided, of course, that the matrices are con- 
formable for multiplication. It is easy to see why the identity matrix has this 
property. Recall that the only nonzero elements of I are equal to 1 and are 
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on the principal diagonal. This fact can be expressed simply with the help of 
the symbol known as the Kronecker delta, written as 6;;. The definition is 


_ fi ifi=j, 
i= 44 ifiA j. (128) 


The ij*" element of I is just 5;;. By (1.27), the ij*® element of AI is 


`> Aiklkj = `> Aikôkj = Aaj, 
k=1 


k=1 
since all the terms in the sum over k vanish except that for which k = j. 


A special vector that we frequently use in this book is e. It denotes a col- 
umn vector every element of which is 1. This special vector comes in handy 
whenever one wishes to sum the elements of another vector, because, for any 
n-vector b, 


US Y Dy (1.29) 
i=1 


Matrix multiplication and matrix addition interact in an intuitive way. It 
is easy to check from the definitions of the respective operations that the 
distributive properties hold. That is, assuming that the dimensions of the 
matrices are conformable for the various operations, 


A(B +C) = AB + AC, and 
(B+ C)A = BA + CA. 


In addition, both operations are associative, which means that 
(A+ B)+C=A+(B+C), and 
(AB)C = A(BC). 


The transpose of the product of two matrices is the product of the transposes 
of the matrices with the order reversed. Thus 


(AB)'= B'A'. (1.30) 


The reversal of the order is necessary for the transposed matrices to be con- 
formable for multiplication. The result (1.30) can be proved immediately by 
writing out the typical entries of both sides and checking that 


(AB)i; = (AB) j= >> Aje Bes = >| (B")in(A" ny = (BUA az, 
k=1 k=1 


where m is the number of columns of A and the number of rows of B. It is 
always possible to multiply a matrix by its own transpose: If A is n x m, then 
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A’ is mxn, A'A ism x m, and AA! is nxn. It follows directly from (1.30) 
that both of these matrix products are symmetric: 


A'A =(AA) and AA =(AA' Y". 


It is frequently necessary to multiply a matrix, say B, by a scalar, say a. 
Multiplication by a scalar works exactly the way one would expect: Every 
element of B is multiplied by a. Since multiplication by a scalar is commuta- 
tive, we can write this either as aB or as Ba, but aB is the more common 
notation. 


Occasionally, it is necessary to multiply two matrices together element by 
element. The result is called the direct product of the two matrices. The 
direct product of A and B is denoted AxB, and a typical element of it is 
equal to Aj; Bj;. 


A square matrix may or may not be invertible. If A is invertible, then it has 
an inverse matrix A`! with the property that 


AA“! = AA =I. 


If A is symmetric, then so is ATt. If A is triangular, then so is A7t. Except 
in certain special cases, it is not easy to calculate the inverse of a matrix by 
hand. One such special case is that of a diagonal matrix, say D, with typical 
diagonal element D;;. It is easy to verify that D~! is also a diagonal matrix, 
with typical diagonal element D,,; af 


If an n x n square matrix A is invertible, then its rank is n. Such a matrix is 
said to have full rank. If a square matrix does not have full rank, and therefore 
is not invertible, it is said to be singular. If a square matrix is singular, its 
rank must be less than its dimension. If, by omitting j rows and j columns 
of A, we can obtain a matrix A’ that is invertible, and if 7 is the smallest 
number for which this is true, the rank of A is n — j. More generally, for 
matrices that are not necessarily square, the rank is the largest number m 
for which an m x m nonsingular matrix can be constructed by omitting some 
rows and some columns from the original matrix. The rank of a matrix is 
closely related to the geometry of vector spaces, which will be discussed in 
the next chapter. 


Regression Models and Matrix Notation 


The simple linear regression model (1.01) can easily be written in matrix 
notation. If we stack the model for all the observations, we obtain 


yı = Pi t+ b2Xı + u 


y2 = bı + BoX2 + U2 
(1.31) 


Yn = b1 + b2Xn + un. 
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Let y denote an n—vector with typical element y;, u an n-vector with typical 
element uz, X an n x 2 matrix that consists of a column of 1s and a column 
with typical element X;, and @ a 2-vector with typical element (;, i = 1,2. 
Thus we have 


yı Uy 1 xy 
Y2 ug 1 Xə 
y= p v= - AS , and pa 
: : TE Be 
Yn Un L An 
Equations (1.31) can now be rewritten as 
y= Xß+u. (1.32) 


It is easy to verify from the rules of matrix multiplication that a typical row 
of (1.32) is a typical row of (1.31). When we postmultiply the matrix X by 
the vector 3, we obtain a vector XZ with typical element 61 + B2 Xt. 


When a regression model is written in the form (1.32), the separate columns 
of the matrix X are called regressors, and the column vector y is called 
the regressand. In (1.31), there are just two regressors, corresponding to 
the constant and one explanatory variable. One advantage of writing the 
regression model in the form (1.32) is that we are not restricted to just one 
or two regressors. Suppose that we have k regressors, one of which may or 
may not correspond to a constant, and the others to a number of explanatory 
variables. Then the matrix X becomes 


X44 X42 os Xi 
X21 X22 a Xok 

Xel. . a (1.33) 
Xnı Xn2 nee Xnk 


where X;; denotes the t'® observation on the itè regressor, and the vector 3 
now has k elements, 3; through 8p. Equation (1.32) remains perfectly valid 
when X and @ are redefined in this way. A typical row of this equation is 


k 
ye = Xi i= > BX t+ ue, (1.34) 
i=l 
where we have used X; to denote the tt? row of X. 


In (1.32), we used the rules of matrix multiplication to write the regression 
function, for the entire sample, in a very simple form. These rules make it 
possible to find equally convenient expressions for other aspects of regression 
models. The key fact is that every element of the product of two matrices is a 
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summation. Thus it is often very convenient to use matrix algebra when deal- 
ing with summations. Consider, for example, the matrix of sums of squares 
and cross-products of the X matrix. This is a k x k symmetric matrix, of 
which a typical element is either 


n n 
X ) 2 X j 

Xi or XtiXtj, 
i=l t=1 


the former being a typical diagonal element and the latter a typical off- 
diagonal one. This entire matrix can be written very compactly as X |X. 
Similarly, the vector with typical element 


n 
` Xti Yt 
t=1 


can be written as X'y. As we will see in the next section, the least squares 
estimates of B depend only on the matrix X'X and the vector X'y. 


Partitioned Matrices 


There are many ways of writing an n x k matrix X that are intermediate 
between the straightforward notation X and the full element-by-element de- 
composition of X given in (1.33). We might wish to separate the columns 
while grouping the rows, as 


X = | xı Lə £r |, 


nxk MXL nX Il mX i 


or we might wish to separate the rows but not the columns, as 


Xi 1xk 
Xo 1xk 
X=]: 
Xn, 1xk 
nxk 
To save space, we can also write this as X = [X Xoi... i Xn). There is no 


restriction on how a matrix can be partitioned, so long as all the submatrices 
or blocks fit together correctly. Thus we might have 


kı k2 
X= P Xı2 | ni 
X21 X22 n2 


with the submatrix Xj, of dimensions nı x kı, Xı2 of dimensions nı x ko, 
Xo», of dimensions nəs x kı, and X22 of dimensions nə X kə, with nı +n =n 
and kı + kə = k. Thus Xj, and Xj. have the same number of rows, and 
also X2; and X22, as required for the submatrices to fit together horizontally. 
Similarly, X,,; and Xə; have the same number of columns, and also Xj2 and 
X22, as required for the submatrices to fit together vertically as well. 
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If two matrices A and B of the same dimensions are partitioned in exactly 
the same way, they can be added or subtracted block by block. A simple 
example is 


A+B=[A, Ao]+[B, Bo) =[A1+ Bi A+ Bə], 


where A, and Bı have the same dimensions, as do Ay and Bo. 


More interestingly, as we now explain, matrix multiplication can sometimes 
be performed block by block on partitioned matrices. If the product AB 
exists, then A has as many columns as B has rows. Now suppose that the 
columns of A are partitioned in the same way as the rows of B. Then 


Bı 
By 
AB=[A,; Asp © Ap] . 
B, 
Here each A;, i = 1,...,p, has as many columns as the corresponding B; 


has rows. The product can be computed following the usual rules for matrix 
multiplication just as though the blocks were scalars, yielding the result 


P 
AB=)_ A;B;. (1.35) 


i=l 


To see this, it is enough to compute the typical element of each side of equation 
(1.35) directly and observe that they are the same. Matrix multiplication 
can also be performed block by block on matrices that are partitioned both 
horizontally and vertically, provided all the submatrices are conformable; see 
Exercise 1.17. 


These results on multiplying partitioned matrices lead to a useful corollary. 
Suppose that we are interested only in the first m rows of a product AB, 
where A has more than m rows. Then we can partition the rows of A into 
two blocks, the first with m rows, the second with all the rest. We need not 
partition B at all. Then 


AB= 4) |B = beer (1.36) 


This works because A; and A» both have the full number of columns of A, 
which must be the same as the number of rows of B, since AB exists. It 
is clear from the rightmost expression in (1.36) that the first m rows of AB 
are given by A,B. In order to obtain any subset of the rows of a matrix 
product of arbitrarily many factors, the rule is that we take the submatrix of 
the leftmost factor that contains just the rows we want, and then multiply it 
by all the other factors unchanged. Similarly, if we want to select a subset 
of columns of a matrix product, we can just select them from the rightmost 
factor, leaving all the factors to the left unchanged. 
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Almost all econometric models contain unknown parameters. For most of the 
uses to which such models can be put, it is necessary to have estimates of these 
parameters. To compute parameter estimates, we need both a model contain- 
ing the parameters and a sample made up of observed data. If the model is 
correctly specified, it describes the real-world mechanism which generated the 
data in our sample. 


It is common in statistics to speak of the “population” from which a sample 
is drawn. Recall the use of the term “population mean” as a synonym for 
the mathematical term “expectation”; see Section 1.2. The expression is a 
holdover from the time when statistics was biostatistics, and the object of 
study was the human population, usually that of a specific town or country, 
from which random samples were drawn by statisticians for study. The av- 
erage weight of all members of the population, for instance, would then be 
estimated by the mean of the weights of the individuals in the sample, that 
is, by the sample mean of individuals’ weights. The sample mean was thus an 
estimate of the population mean. The underlying idea is just that the sample 
represents the population from which it has been drawn. 


In econometrics, the use of the term population is simply a metaphor. A better 
concept is that of a data-generating process, or DGP. By this term, we mean 
whatever mechanism is at work in the real world of economic activity giving 
rise to the numbers in our samples, that is, precisely the mechanism that our 
econometric model is supposed to describe. A data-generating process is thus 
the analog in econometrics of a population in biostatistics. Samples may be 
drawn from a DGP just as they may be drawn from a population. In both 
cases, the samples are assumed to be representative of the DGP or population 
from which they are drawn. 


A very natural way to estimate parameters is to replace population means by 
sample means. This technique is called the method of moments, and it is one 
of the most widely-used estimation methods in statistics. As the name implies, 
it can be used with moments other than the mean. In general, the method 
of moments, sometimes called MM for short, estimates population moments 
by the corresponding sample moments. In order to apply this method to 
regression models, we must use the facts that population moments are expec- 
tations, and that regression models are specified in terms of the conditional 
expectations of the error terms. 


Estimating the Simple Linear Regression Model 


Let us now see how the principle of replacing population means by sample 
means works for the simple linear regression model (1.01). The error term for 
observation t is 


ut = Yt — Pı — BoXt, 
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and, according to our model, the expectation of this error term is zero. Since 
we have n error terms for a sample of size n, we can consider the sample mean 
of the error terms: 

nm 


Yu = 1N (un — bı — b2 X). (1.37) 
t= 


t=1 
We would like to set this sample mean equal to zero. 


Suppose to begin with that G2 = 0. This reduces the number of parameters 
in the model to just one. In that case, there is just one value of 3, which will 
allow (1.37) to be zero. The equation defining this value is 


1N (u = ĝı)=0. (1.38) 


Since 2, is common to all the observations and thus does not depend on the 
index t, (1.38) can be written as 


2S oy = = 0. 
t=1 


We can easily solve this equation to obtain an estimate ĝi. This estimate is 
just the mean of the observed values of the dependent variable, 


b540 w. (1.39) 
t=1 


Thus, if we wish to estimate the population mean of the y, which is what 
G1 is in our model when (2 = 0, the method of moments tells us to use the 
sample mean as our estimate. 


It is not obvious at first glance how to use the method of moments if we put 
the second parameter (2 back into the model. Equation (1.38) would become 


= D(a — Bi — 2X4) = 0, (1.40) 
t= 


but this is just one equation, and there are two unknowns. In order to obtain 
another equation, we can use the fact that our model specifies that the mean 
of uz is 0 conditional on the explanatory variable X;. Actually, it may well 
specify that the mean of u, is 0 conditional on many other things as well, 
depending on our choice of the information set Q4, but we will ignore this for 
now. The conditional mean assumption implies that not only is E(u) = 0, 
but that E(X;uz) = 0 as well, since, by (1.16) and (1.17), 
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Thus we can supplement (1.40) by the following equation, which replaces the 
population mean in (1.41) by the corresponding sample mean, 


m2 Kel — pı — BoXz) = 0. (1.42) 
t= 


The equations (1.40) and (1.42) are two linear equations in two unknowns, 
G3, and B2. Except in rare conditions, which can easily be ruled out, they 
will have a unique solution that is not difficult to calculate. Solving these 
equations yields the MM estimates. 


We could just solve (1.40) and (1.42) directly, but it is far more illuminating 
to rewrite them in matrix form. Since 3, and 62 do not depend on t, these 
two equations can be written as 


Bı + GE x) 6 = Sou 
t=1 t=1 
(Fox) a + GE) -4D Xue 


Multiplying both equations by n and using the rules of matrix multiplication 
that were discussed in the last section, we can also write them as 


Ls xX aa A 7 eal (1.43) 


Equations (1.43) can be rewritten much more compactly. As we saw in the 
last section, the model (1.01) is simply a special case of the multiple linear 
regression model 

y= XP +t, (1.44) 


where the n-vector y has typical element y+, the k-vector B has typical 
element @;, and, in general, the matrix X is n x k. In this case, X is n x 2; it 
can be written as X =[e a], where ų¿ denotes a column of 1s, and æ denotes 
a column with typical element X;. Thus, recalling (1.29), we see that 


Xy _ | Jwi Yt | 
Da Xite 
and 
XX= | j saat 
De Xi De X? 


These are the principal quantities that appear in the equations (1.43). Thus 
it is clear that we can rewrite those equations as 


X'XB = X'y. (1.45) 
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To find the estimator B that solves (1.45), we simply multiply it by the inverse 
of the matrix X'X, assuming that this inverse exists. This yields the famous 
formula 

B=(X'X)1Xly. (1.46) 


The estimator @ given by this formula is generally called the ordinary least 
squares, or OLS, estimator for the linear regression model.+ Why it is called 
this, rather than the MM estimator, will be explained shortly. 


Estimating the Multiple Linear Regression Model 


The formula (1.46) gives us the OLS, and MM, estimator for the simple linear 
regression model (1.01), but in fact it does far more than that. As we now 
show, it also gives us the MM estimator for the multiple linear regression 
model (1.44). Since each of the explanatory variables is required to be in the 
information set Q, we have, fori =1,...,k, 


E(Xt: Ut) = 0; 


which, in the corresponding sample mean form, yields 
1 
a >, Xuly — X18) = 0. (1.47) 
t=1 


(Recall from (1.34) that X; denotes the ¢** row of X.) As i varies from 1 
to k, equation (1.47) yields k equations for the k unknown components of 3. 
In most cases, there will be a constant, which we may take to be the first 
regressor. If so, X; = 1, and the first of these equations simply says that the 
sample mean of the error terms is 0. 


In matrix form, after multiplying them by n, the k equations of (1.47) can be 
written as 
X'(y — XB) =0. (1.48) 


The notation 0 is used to signify a zero vector, here a k—vector, each element 
of which is zero. Equations (1.48) are clearly equivalent to equations (1.45). 
Thus solving them yields the estimator (1.46), which applies no matter what 
the number of regressors. 


It is easy to see that the OLS estimator (1.46) depends on y and X exclu- 
sively through a number of scalar products. Each column a; of the matrix X 
corresponds to one of the regressors, as does each row æ;! of the transposed 


f Econometricians generally make a distinction between an estimate, which is 
simply a number used to estimate some parameter, normally based on a par- 
ticular data set, and an estimator, which is a rule, such as (1.46), for obtaining 
estimates from any set of data. 
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matrix X'. Thus we can write X'y as 


zil aly 

T2 T2 Y 
X'y = y = 

Ek ERY 


The elements of the rightmost expression here are just the scalar products of 
the regressors æ; with the regressand y. Similarly, we can write X'X as 


zı Tı tı L1,XQ +++ HL Tk 

+ T2 T2 Lı T2 T2 ++: T2 Tk 
XX= ; [ ey Lp oes Lx, | = 

Lk Tk Cı LeM2 +++ LEME 


Once more, all the elements of the rightmost expression are scalar products of 
pairs of regressors. Since X'X can be expressed exclusively in terms of scalar 
products of the variables of the regression, the same is true of its inverse, the 
elements of which will be in general complicated functions of those scalar 


A 


products. Thus @ is a function solely of scalar products of pairs of variables. 


Least Squares Estimation 


We have derived the estimator (1.46) by using the method of moments. De- 
riving it in this way has at least two major advantages. Firstly, the method 
of moments is a very general and very powerful principle of estimation, one 
that we will encounter again and again throughout this book. Secondly, by 
using the method of moments, we were able to obtain (1.46) without making 
any use of calculus. However, as we have already remarked, (1.46) is generally 
referred to as the OLS estimator, not the MM estimator. It is interesting to 
see why this is so. 


For the multiple linear regression model (1.44), the expression y, — X; is 
equal to the error term for the tt}? observation, but only if the correct value 
of the parameter vector @ is used. If the same expression is thought of as a 
function of 3, with 8 allowed to vary arbitrarily, then it is called a residual, 
more specifically, the residual associated with the tt? observation. Similarly, 
the n-vector y — XB is called the vector of residuals. The sum of the squares 
of the components of the vector of residuals is called the sum of squared 
residuals, or SSR. Since this sum is a scalar, the sum of squared residuals is 
a scalar-valued function of the k-vector (3: 


n 


SSR(B) = So (u — X18)’. (1.49) 


t=1 


The notation here emphasizes the fact that this function can be computed for 
arbitrary values of the argument @ purely in terms of the observed data y 
and X. 
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The idea of least squares estimation is to minimize the sum of squared resid- 
uals associated with a regression model. At this point, it may not be at all 
clear why we would wish to do such a thing. However, it can be shown that 
the parameter vector Ê which minimizes (1.49) is the same as the MM esti- 
mator (1.46). This being so, we will regularly use the traditional terminology 
associated with linear regressions, based on least squares. Thus, the parameter 
estimates which are the components of the vector 3 that minimizes the SSR 
(1.49) are called the least squares estimates, and the corresponding vector of 
residuals is called the vector of least squares residuals. When least squares 
is used to estimate a linear regression model like (1.01), it is called ordinary 
least squares, or OLS, to distinguish it from other varieties of least squares 
that we will encounter later, such as nonlinear least squares (Chapter 6) and 
generalized least squares (Chapter 7). 


Consider briefly the simplest case of (1.01), in which 82 = 0 and the model 
contains only a constant term. Expression (1.49) becomes 


n 


SSR(61) = So (ye — 61)? = Sou? + 087 aA te (1.50) 
t=1 t=1 t=1 
Differentiating the rightmost expression in (1.50) with respect to ĝı and set- 
ting the derivative equal to zero gives the following first-order condition for a 
minimum: 
OSSR ” 

———— = 23;n—2 =). 1.51 

dG; 1 2 Yt ( ) 
For this simple model, the matrix X consists solely of the constant vector, v. 
Therefore, by (1.29), X'X = i'i =n, and X'y = t'y = X; y: Thus, if 
the first-order condition (1.51) is multiplied by one-half, it can be rewritten 
as i't 61 = t'y, which is clearly just a special case of (1.45). Solving (1.51) 
for 3, yields the sample mean of the y4, 


ĝi = iS = (e'e) tly. (1.52) 
t=1 


We already saw, in (1.39), that this is the MM estimator for the model 
with G2 = 0. The rightmost expression in (1.52) makes it clear that the 
sample mean is just a special case of the famous formula (1.46). 

Not surprisingly, the OLS and MM estimators are also equivalent in the mul- 
tiple linear regression model. For this model, 


SSR(B) = (y — XB) (y — XB). (1.53) 


If this inner product is written out in terms of the scalar components of y, X, 
and (3, it is easy enough to show that the first-order conditions for minimizing 
the SSR (1.53) can be written as (1.45); see Exercise 1.20. Thus we conclude 
that (1.46) provides a general formula for the OLS estimator Ê in the multiple 
linear regression model. 
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Final Remarks 


We have seen that it is perfectly easy to obtain an algebraic expression, (1.46), 
for the OLS estimator 8B. With modern computers and appropriate software, 
it is also easy to obtain OLS estimates numerically, even for regressions with 
millions of observations and dozens of explanatory variables; the time-honored 
term for doing so is “running a regression”. What is not so easy, and will 
occupy us for most of the next four chapters, is to understand the properties 
of these estimates. 


We will be concerned with two types of properties. The first type, numerical 
properties, arise as a consequence of the way that OLS estimates are obtained. 
These properties hold for every set of OLS estimates, no matter how the data 
were generated. That they hold for any data set can easily be verified by direct 
calculation. The numerical properties of OLS will be discussed in Chapter 2. 
The second type, statistical properties, depend on the way in which the data 
were generated. They can be verified theoretically, under certain assumptions, 
and they can be illustrated by simulation, but we can never prove that they 
are true for any given data set. The statistical properties of OLS will be 
discussed in detail in Chapters 3, 4, and 5. 


Readers who seek a deeper treatment of the topics dealt with in the first two 
sections may wish to consult Gallant (1997) or Mittelhammer (1996). 


1.6 Notes on the Exercises 


Each chapter of this book is followed by a set of exercises. These exercises are 
of various sorts, and they have various intended functions. Some are, quite 
simply, just for practice. Some serve chiefly to extend the material presented 
in the chapter. In many cases, the new material in such exercises recurs 
later in the book, and it is hoped that readers who have worked through 
them will follow later discussions more easily. A case in point concerns the 
bootstrap. Some of the exercises in this chapter and the next two are designed 
to familiarize readers with the tools that are used to implement the bootstrap, 
so that, when it is introduced formally in Chapter 4, the bootstrap will appear 
as a natural development. Other exercises have a tidying-up function. Details 
left out of the discussions in the main text are taken up, and conscientious 
readers can check that unproved claims made in the text are in fact justified. 


Many of the exercises require the reader to make use of a computer, sometimes 
to compute estimates and test statistics using real or simulated data, and 
sometimes for the purpose of doing simulations. There are a great many 
computer packages that are capable of doing the things we ask for in the 
exercises, and it seems unnecessary to make any specific recommendations as 
to what software would be best. Besides, we expect that many readers will 
already have developed their own personal preferences for software packages, 
and we know better than to try to upset such preferences. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


1.7 Exercises 39 


Some exercises require, not only a computer, but also actual economic data. 
It cannot be stressed enough that econometrics is an empirical discipline, and 
that the analysis of economic data is its raison d’étre. All of the data needed 
for the exercises are available from the World Wide Web site for this book. 
The address is 


http: //www.econ.queensu.ca/ETM/ 


This web site will ultimately contain corrections and updates to the book as 
well as the data needed for the exercises. 


1.7 Exercises 


1.1 Consider a sample of n observations, y1,y2,---;Yn, on some random vari- 
able Y. The empirical distribution function, or EDF, of this sample is a dis- 
crete distribution with n possible points. These points are just the n observed 
points, y1,y2,---,Yn-. Each point is assigned the same probability, which is 
just 1/n, in order to ensure that all the probabilities sum to 1. 


Compute the expectation of the discrete distribution characterized by the 
EDF, and show that it is equal to the sample mean, that is, the unweighted 
average of the n sample points, Y1, y2,---,Yn- 


1.2 A random variable computed as the ratio of two independent standard normal 
variables follows what is called the Cauchy distribution. It can be shown that 
the density of this distribution is 


Co 


m(1 +r?) 
Show that the Cauchy distribution has no first moment, which means that its 
expectation does not exist. 


Use your favorite random number generator to generate samples of 10, 100, 
1,000, and 10,000 drawings from the Cauchy distribution, and as many in- 
termediate values of n as you have patience or computer time for. For each 
sample, compute the sample mean. Do these sample means seem to converge 
to zero as the sample size increases? Repeat the exercise with drawings from 
the standard normal density. Do these sample means tend to converge to zero 
as the sample size increases? 


1.3 Consider two events A and B such that A C B. Compute Pr(A | B) in terms 
of Pr(A) and Pr(B). Interpret the result. 
1.4 Prove Bayes’ Theorem. This famous theorem states that, for any two events 
A and B with nonzero probabilities, 
B| A) Pr(A) 
Pr(B) 


ma eee 


Another form of the theorem deals with two continuous random variables X1 
and X2, which have a joint density f(x1, 22). Show that, for any values x1 
and x2 that are permissible for X; and X2, respectively, 


_ Felz) f) 
pea ~ Ga 
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1.7 


1.8 


1.9 


1.10 


1.11 
1.12 


Regression Models 


Suppose that X and Y are two binary random variables. Their joint distri- 
bution is given in the following table. 


What is the marginal distribution of Y? What is the distribution of Y con- 
ditional on X = 0? What is the distribution of Y conditional on X = 1? 


Demonstrate the Law of Iterated Expectations explicitly by showing that 
E(E(X |Y)) = E(X). Let h(Y) = Y’. Show explicitly that E(Xh(Y)|Y) = 
h(Y )E(X |Y) in this case. 


Using expression (1.06) for the density $(x) of the standard normal distribu- 
tion, show that the derivative of ¢(x) is the function —x¢(x), and that the 
second derivative is (x? —1)¢(«). Use these facts to show that the expectation 
of a standard normal random variable is 0, and that its variance is 1. These 
two properties account for the use of the term “standard.” 


A normally distributed random variable can have any mean p and any positive 
variance a7. Such a random variable is said to follow the N (u, 07) distribution. 
A standard normal variable therefore has the N(0,1) distribution. Suppose 
that X has the standard normal distribution. Show that the random variable 
Z =p+oxX has mean p and variance o°. 


Compute the CDF of the N (u, 07) distribution in terms of ®(-), the CDF of 
the standard normal distribution. Differentiate your answer so as to obtain 
the PDF of N (u, 0°). 


If two random variables X1 and Xə are statistically independent, show that 
E(Xı | X2) = E(X1). 


The covariance of two random variables X; and X92, which is often written 
as Cov( X1, X2), is defined as the expectation of the product of Xı — E(X1) 
and Xə — E(X2). Consider a random variable X; with mean zero. Show that 
the covariance of Xı and any other random variable X2, whether it has mean 
zero or not, is just the expectation of the product of Xı and X9. 


Show that the covariance of the random variables E(X: | X2) and Xı — 
E(X1|X2) is zero. It is easiest to show this result by first showing that 
it is true when the covariance is computed conditional on X9. 


Show also that the variance of the random variable X1 — E(X | X2) cannot 
be greater than the variance of X 1, and that the two variances will be equal 
if Xq and Xə are independent. This result shows how one random variable 
can be informative about another: Conditioning on it reduces variance unless 
the two variables are independent. 


Prove that, if X; and Xə are statistically independent, Cov(X 1, X2) = 0. 


Let a random variable X 1 be distributed as N(0,1). Now suppose that a 
second random variable, X2, is constructed as the product of Xı and an 
independent random variable Z, which equals 1 with probability 1/2 and —1 
with probability 1/2. 
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1.16 


What is the (marginal) distribution of X2? What is the covariance between 
Xı and X2? What is the distribution of X1 conditional on X2? 


Consider the linear regression models 


iy: yt = b1 + b2 Xt + ut and 
Ha: log yt = 71 +72 log Xt + ut. 


Suppose that the data are actually generated by Ho, with y1 = 1.5 and 
y2 = 0.5, and that the value of X; varies from 10 to 110 with an average 
value of 60. Ignore the error terms and consider the deterministic relations 
between y; and X+ implied by the two models. Find the values of G; and (2 
that make the relation given by Hı have the same level and the same value 
of dyz/dX; as the level and value of dy¿/dXı implied by the relation given 
by Hə when it is evaluated at the average value of the regressor. 


Using the deterministic relations, plot yz as a function of X+ for both models 
for 10 < X; < 110. Also plot log yz as a function of log X+ for both models for 
the same range of X+. How well do the two models approximate each other 
in each of the plots? 


Consider two matrices A and B of dimensions such that the product AB 
exists. Show that the it? row of AB is the matrix product of the it row of 
A with the entire matrix B. Show that this result implies that the it! row of 
a product ABC ..., with arbitrarily many factors, is the product of the ith 


row of A with BC.... 


What is the corresponding result for the columns of AB? What is the corre- 
sponding result for the columns of ABC ...? 


Consider two invertible square matrices A and B, of the same dimensions. 
Show that the inverse of the product AB exists and is given by the formula 
(AB) '=B At. 

This shows that there is a reversal rule for inverses as well as for transposes; 

see (1.30). 


Show that the transpose of the product of an arbitrary number of factors is 
the product of the transposes of the individual factors in completely reversed 
order: 


(ABC...) =. --C'B'A. 
Show also that an analogous result holds for the inverse of the product of an 
arbitrary number of factors. 
Consider the following example of multiplying partitioned matrices: 


be a | i i] _ Pa + Aj2Bo; A11Bı2 + real 
A21 Ago}! | Boi B22 Ao By, + A22B21 A21 B12 + A22B22 | 


Check all the expressions on the right-hand side, verifying that all products 
are well defined and that all sums are of matrices of the same dimensions. 


Suppose that X = |e Xı Xə], where X is n x k, e is an n-vector of 1s, 
Xı is n x kı, and Xə is n x kg. What is the matrix X'X in terms of 
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1.20 


1.21 


1.22 


Regression Models 


the components of X? What are the dimensions of its component matrices? 
What is the element in the upper left-hand corner of X TX equal to? 


Fix a sample size of n = 100, and simulate the very simplest regression model, 
namely, yt = 8 + ur. Set G = 1, and let the error terms uz be drawings from 
the standard normal distribution. Compute the sample mean of the yz, 


t=1 


Use your favorite econometrics software package to run a regression with y, 
the 100 x 1 vector with typical element yz, as the dependent variable, and a 
constant as the sole explanatory variable. Show that the OLS estimate of the 
constant is equal to the sample mean. Why is this a necessary consequence 
of the formula (1.46)? 


For the multiple linear regression model (1.44), the sum of squared residuals 
can be written as 
n 
SSR(B) = X (v — X18)? = (y — XB)"(y - XB). 


t=1 


Show that, if we minimize SSR(@) with respect to 6, the minimizing value of 
B is B, the OLS estimator given by (1.46). The easiest way is to show that 
the first-order conditions for a minimum are exactly the equations (1.47), 
or (1.48), that arise from MM estimation. This can be done without using 
matrix calculus. 


The file consumption.data contains data on real personal disposable income 
and consumption expenditures in Canada, seasonally adjusted in 1986 dol- 
lars, from the first quarter of 1947 until the last quarter of 1996. The sim- 
plest imaginable model of the Canadian consumption function would have 
consumption expenditures as the dependent variable, and a constant and 
personal disposable income as explanatory variables. Run this regression for 
the period 1953:1 to 1996:4. What is your estimate of the marginal propensity 
to consume out of disposable income? 


Plot a graph of the OLS residuals for the consumption function regression 
against time. All modern regression packages will generate these residuals for 
you on request. Does the appearance of the residuals suggest that this model 
of the consumption function is well specified? 


Simulate the consumption function model you have just estimated in exercise 
1.21 for the same sample period, using the actual data on disposable income. 
For the parameters, use the OLS estimates obtained in exercise 1.21. For 
the error terms, use drawings from the N(0, 8°) distribution, where s? is the 
estimate of the error variance produced by the regression package. 


Next, run a regression using the simulated consumption data as the dependent 
variable and the constant and disposable income as explanatory variables. Are 
the parameter estimates the same as those obtained using the real data? Why 
or why not? 


Plot the residuals from the regression with simulated data. Does the plot look 
substantially different from the one obtained using the real data? It should! 
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The Geometry of Linear Regression 


2.1 Introduction 


In Chapter 1, we introduced regression models, both linear and nonlinear, 
and discussed how to estimate linear regression models by using the method 
of moments. We saw that all n observations of a linear regression model with 
k regressors can be written as 


y=XP+u, (2.01) 


where y and u are n-vectors, X is an n x k matrix, one column of which may 
be a constant term, and 8 is a k-vector. We also saw that the MM estimates, 
usually called the ordinary least squares or OLS estimates, of the vector 8 are 


B=(X'X)1Xly. (2.02) 


In this chapter, we will be concerned with the numerical properties of these 
OLS estimates. We refer to certain properties of estimates as “numerical” if 
they have nothing to do with how the data were actually generated. Such 
properties hold for every set of data by virtue of the way in which B is com- 
puted, and the fact that they hold can always be verified by direct calculation. 
In contrast, the statistical properties of OLS estimates, which will be discussed 
in Chapter 3, necessarily depend on unverifiable assumptions about how the 
data were generated, and they can never be verified for any actual data set. 


In order to understand the numerical properties of OLS estimates, it is useful 
to look at them from the perspective of Euclidean geometry. This geometrical 
interpretation is remarkably simple. Essentially, it involves using Pythagoras’ 
Theorem and a little bit of high-school trigonometry in the context of fi- 
nite-dimensional vector spaces. Although this approach is simple, it is very 
powerful. Once one has a thorough grasp of the geometry involved in ordi- 
nary least squares, one can often save oneself many tedious lines of algebra 
by a simple geometrical argument. We will encounter many examples of this 
throughout the book. 


In the next section, we review some relatively elementary material on the 
geometry of vector spaces and Pythagoras’ Theorem. In Section 2.3, we then 
discuss the most important numerical properties of OLS estimation from a 
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geometrical perspective. In Section 2.4, we introduce an extremely useful 
result called the FWL Theorem, and in Section 2.5 we present a number of 
applications of this theorem. Finally, in Section 2.6, we discuss how and to 
what extent individual observations influence parameter estimates. 


2.2 The Geometry of Vector Spaces 


In Section 1.4, an n-vector was defined as a column vector with n elements, 
that is, an n x 1 matrix. The elements of such a vector are real numbers. 
The usual notation for the real line is R, and it is therefore natural to denote 
the set of n-vectors as R”. However, in order to use the insights of Euclidean 
geometry to enhance our understanding of the algebra of vectors and matrices, 
it is desirable to introduce the notion of a Euclidean space in n dimensions, 
which we will denote as E”. The difference between R” and E” is not that they 
consist of different sorts of vectors, but rather that a wider set of operations 
is defined on E”. A shorthand way of saying that a vector æ belongs to an 
n-dimensional Euclidean space is to write a € E”. 


Addition and subtraction of vectors in Æ” is no different from the addition 
and subtraction of n x 1 matrices discussed in Section 1.4. The same thing is 
true of multiplication by a scalar in FE”. The final operation essential to E” 
is that of the scalar or inner product. For any two vectors x,y € E”, their 
scalar product is 

(æ, y) = xy. 


The notation on the left is generally used in the context of the geometry of 
vectors, while the notation on the right is generally used in the context of 
matrix algebra. Note that (x,y) = (y, Œ), since «'y = y'æ. Thus the scalar 
product is commutative. 


The scalar product is what allows us to make a close connection between 
n-vectors considered as matrices and considered as geometrical objects. It 
allows us to define the length of any vector in Æ”. The length, or norm, of a 


vector æ is simply 
al] = (x's). 


This is just the square root of the inner product of x with itself. In scalar 


terms, it is 
n 1/2 
æl = (>. 2?) . (2.03) 


i=l 


Pythagoras’ Theorem 


The definition (2.03) is inspired by the celebrated theorem of Pythagoras, 
which says that the square on the longest side of a right-angled triangle is 
equal to the sum of the squares on the other two sides. This longest side 
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Figure 2.1 Pythagoras’ Theorem 


is called the hypotenuse. Pythagoras’ Theorem is illustrated in Figure 2.1. 
The figure shows a right-angled triangle, ABC, with hypotenuse AC, and two 
other sides, AB and BC, of lengths zı and x2 respectively. The squares on 
each of the three sides of the triangle are drawn, and the area of the square 
on the hypotenuse is shown as x? + x3, in accordance with the theorem. 


A beautiful proof of Pythagoras’ Theorem, not often found in geometry texts, 
is shown in Figure 2.2. Two squares of equal area are drawn. Each square 
contains four copies of the same right-angled triangle. The square on the left 
also contains the squares on the two shorter sides of the triangle, while the 


Figure 2.2 Proof of Pythagoras’ Theorem 
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B 


O zı A 


Figure 2.3 A vector x in E? 


square on the right contains the square on the hypotenuse. The theorem 
follows at once. 


Any vector « € E? has two components, usually denoted as xı and x2. These 
two components can be interpreted as the Cartesian coordinates of the vec- 
tor in the plane. The situation is illustrated in Figure 2.3. With O as the 
origin of the coordinates, a right-angled triangle is formed by the lines OA, 
AB, and OB. The length of the horizontal side of the triangle, OA, is the 
horizontal coordinate xı. The length of the vertical side, AB, is the vertical 
coordinate x2. Thus the point B has Cartesian coordinates (x1, £2). The vec- 
tor æ itself is usually represented as the hypotenuse of the triangle, OB, that 
is, the directed line (depicted as an arrow) joining the origin to the point B, 
with coordinates (x1, £2). By Pythagoras’ Theorem, the length of the vector 
x, the hypotenuse of the triangle, is (x? +23)!/?. This is what (2.03) becomes 
for the special case n = 2. 


Vector Geometry in Two Dimensions 


Let x and y be two vectors in E?, with components (21,272) and (yi, y2), 
respectively. Then, by the rules of matrix addition, the components of £ + y 
are (41 + y1, £2 + y2). Figure 2.4 shows how the addition of x and y can 
be performed geometrically in two different ways. The vector æ is drawn as 
the directed line segment, or arrow, from the origin O to the point A with 
coordinates (£1, £2). The vector y can be drawn similarly and represented 
by the arrow OB. However, we could also draw y starting, not at O, but at 
the point reached after drawing x, namely A. The arrow AC has the same 
length and direction as OB, and we will see in general that arrows with the 
same length and direction can be taken to represent the same vector. It is 
clear by construction that the coordinates of C are (#1 + yi, £2 + y2), that is, 
the coordinates of x +y. Thus the sum z+ y is represented geometrically by 
the arrow OC. 
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O yı Tı 


Figure 2.4 Addition of vectors 


The classical way of adding vectors geometrically is to form a parallelogram 
using the line segments OA and OB that represent the two vectors as adjacent 
sides of the parallelogram. The sum of the two vectors is then the diagonal 
through O of the resulting parallelogram. It is easy to see that this classical 
method also gives the result that the sum of the two vectors is represented 
by the arrow OC, since the figure OACB is just the parallelogram required 
by the construction, and OC is its diagonal through O. The parallelogram 
construction also shows clearly that vector addition is commutative, since 
y + x is represented by OB, for y, followed by BC, for x. The end result is 
once more OC. 


Multiplying a vector by a scalar is also very easy to represent geometrically. 
If a vector x with components (21,22) is multiplied by a scalar a, then ax 
has components (azı, ax2). This is depicted in Figure 2.5, where a = 2. The 
line segments OA and OB represent x and aa, respectively. It is clear that 
even if we move ax so that it starts somewhere other than O, as with CD 
in the figure, the vectors x and aw are always parallel. If œ were negative, 
then ax would simply point in the opposite direction. Thus, for a = —2, ax 
would be represented by DC, rather than CD. 


Another property of multiplication by a scalar is clear from Figure 2.5. By 
direct calculation, 


laa] = (aw, aa)? = Jal (a!x)'/? = Jal lla). (2.04) 


Since a = 2, OB and CD in the figure are twice as long as OA. 
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D 


Figure 2.5 Multiplication by a scalar 


The Geometry of Scalar Products 


The scalar product of two vectors x and y, whether in Æ? or E”, can be 
expressed geometrically in terms of the lengths of the two vectors and the 
angle between them, and this result will turn out to be very useful. In the 
case of E?, it is natural to think of the angle between two vectors as the angle 
between the two line segments that represent them. As we will now show, it 
is also quite easy to define the angle between two vectors in E”. 


If the angle between two vectors is 0, they must be parallel. The vector y is 
parallel to the vector æ if y = aa for some suitable a. In that event, 


T 


(x,y) = (x, ax) = az'g = allel’. 


From (2.04), we know that ||y|| = |a| ||æ||, and so, if a > 0, it follows that 


(x,y) = llæll yll. (2.05) 


Of course, this result is true only if x and y are parallel and point in the same 
direction (rather than in opposite directions). 


For simplicity, consider initially two vectors, w and z, both of length 1, and 
let 0 denote the angle between them. This is illustrated in Figure 2.6. Suppose 
that the first vector, w, has coordinates (1,0). It is therefore represented by 
a horizontal line of length 1 in the figure. Suppose that the second vector, z, 
is also of length 1, that is, ||z|| = 1. Then, by elementary trigonometry, the 
coordinates of z must be (cos0,sin@). To show this, note first that, if so, 


z|? = cos? 0 + sin? 6 = 1, (2.06) 
as required. Next, consider the right-angled triangle OAB, in which the hy- 
potenuse OB represents z and is of length 1, by (2.06). The length of the 


side AB opposite O is sin @, the vertical coordinate of z. Then the sine of 
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A - 


O A 


Figure 2.6 The angle between two vectors 


the angle BOA is given, by the usual trigonometric rule, by the ratio of the 
length of the opposite side AB to that of the hypotenuse OB. This ratio is 
sin ĝ/1 = sin 9, and so the angle BOA is indeed equal to 0. 


Now let us compute the scalar product of w and z. It is 
(w, z) = wiz = w121 + W222 = z1 = cos8, 


because w = 1 and w2 = 0. This result holds for vectors w and z of length 1. 
More generally, let x = aw and y = yz, for positive scalars œ and y. Then 
|æ]| = @ and ||y|| = y. Thus we have 


(x,y) = x'y = ayw'z = ay(w, z). 


Because æ is parallel to w, and y is parallel to z, the angle between a and y 
is the same as that between w and z, namely @. Therefore, 


(x,y) = llæll yl cos 8. (2.07) 


This is the general expression, in geometrical terms, for the scalar product of 
two vectors. It is true in Æ” just as it is in H?, although we have not proved 
this. In fact, we have not quite proved (2.07) even for the two-dimensional 
case, because we made the simplifying assumption that the direction of æ 
and w is horizontal. In Exercise 2.1, we ask the reader to provide a more 
complete proof. 


The cosine of the angle between two vectors provides a natural way to measure 
how close two vectors are in terms of their directions. Recall that cos @ varies 
between —1 and 1; if we measure angles in radians, cos0 = 1, cos7/2 = 0, 
and cos7 = —1. Thus cos @ will be 1 for vectors that are parallel, 0 for vectors 
that are at right angles to each other, and —1 for vectors that point in directly 
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opposite directions. If the angle 0 between the vectors æ and y is a right angle, 
its cosine is 0, and so, from (2.07), the scalar product (a, y) is 0. Conversely, 
if (x,y) = 0, then cos = 0 unless æ or y is a zero vector. If cos0 = 0, it 
follows that 0 = 1/2. Thus, if two nonzero vectors have a zero scalar product, 
they are at right angles. Such vectors are often said to be orthogonal, or, 
less commonly, perpendicular. This definition implies that the zero vector is 
orthogonal to everything. 


Since the cosine function can take on values only between —1 and 1, a conse- 
quence of (2.07) is that 
jz"y| < ||| lly). (2.08) 


This result, which is called the Cauchy-Schwartz inequality, says that the 
inner product of x and y can never be greater than the length of the vector x 
times the length of the vector y. Only if x and y are parallel does the 
inequality in (2.08) become the equality (2.05). Readers are asked to prove 
this result in Exercise 2.2. 


Subspaces of Euclidean Space 


For arbitrary positive integers n, the elements of an n-vector can be thought 
of as the coordinates of a point in E”. In particular, in the regression model 
(2.01), the regressand y and each column of the matrix of regressors X can be 
thought of as vectors in E”. This makes it possible to represent a relationship 
like (2.01) geometrically. 


It is obviously impossible to represent all n dimensions of E” physically 
when n > 3. For the pages of a book, even three dimensions can be too many, 
although a proper use of perspective drawings can allow three dimensions to 
be shown. Fortunately, we can represent (2.01) without needing to draw in 
n dimensions. The key to this is that there are only three vectors in (2.01): 
y, XB, and u. Since only two vectors, XG and u, appear on the right-hand 
side of (2.01), only two dimensions are needed to represent it. Because y is 
equal to XB + u, these two dimensions suffice for y as well. 


To see how this works, we need the concept of a subspace of a Euclidean 
space E”. Normally, such a subspace will have a dimension lower than n. The 
easiest way to define a subspace of E” is in terms of a set of basis vectors. A 
subspace that is of particular interest to us is the one for which the columns 
of X provide the basis vectors. We may denote the k columns of X as 21, 
T2, ...- £k. Then the subspace associated with these k basis vectors will be 
denoted by 8(X) or 8(a1,...,xa,%). The basis vectors are said to span this 
subspace, which will in general be a k-dimensional subspace. 


The subspace 8(a1,...,2%,) consists of every vector that can be formed as a 
linear combination of the æ;, i = 1,...,k. Formally, it is defined as 
k 
S(£1,..., £k) = fz e E” z= Seti, bi eR}. (2.09) 
i=1 
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S*(X) 


8(X) 


Figure 2.7 The spaces $(X) and 8+(X) 


The subspace defined in (2.09) is called the subspace spanned by the £;i, 
i=1,...,k, or the column space of X; less formally, it may simply be referred 
to as the span of X, or the span of the æi. 


The orthogonal complement of $(X) in Æ”, which is denoted $+(X), is the 
set of all vectors w in E” that are orthogonal to everything in $8(X). This 


means that, for every z in 8(X), (w, z) = w'z = 0. Formally, 


8+ (X) = {w € E” | wlz =0 for all z € 8(X)}. 


If the dimension of $(X) is k, then the dimension of 8+(X) is n — k. 


Figure 2.7 illustrates the concepts of a subspace and its orthogonal comple- 
ment for the simplest case, in which n = 2 and k = 1. The matrix X has 
only one column in this case, and it is therefore represented in the figure by a 
single vector, denoted æ. As a consequence, $(X) is 1-dimensional, and, since 
n = 2, 8+(X) is also 1-dimensional. Notice that $(X) and $+(X) would be 
the same if x were any vector, except for the origin, parallel to the straight 
line that represents $(X). 


Now let us return to Æ”. Suppose, to begin with, that k = 2. We have two 
vectors, 2; and x2, which span a subspace of, at most, two dimensions. It 
is always possible to represent vectors in a 2-dimensional space on a piece of 
paper, whether that space is Æ? itself or, as in this case, the 2-dimensional 
subspace of E” spanned by the vectors xı and a2. To represent the first 
vector, 21, we choose an origin and a direction, both of which are entirely 
arbitrary, and draw an arrow of length ||a || in that direction. Suppose that 
the origin is the point O in Figure 2.8, and that the direction is the horizontal 
direction in the plane of the page. Then an arrow to represent x, can be 
drawn as shown in the figure. For #2, we compute its length, ||ax2||, and the 
angle, 0, that it makes with 2;. Suppose for now that 0 Æ 0. Then we choose 
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b2£ə f----------------------------> bixi + b22 


O bızı 


Figure 2.8 A 2-dimensional subspace 


as our second dimension the vertical direction in the plane of the page, with 
the result that we can draw an arrow for x2, as shown. 


Any vector in §(a@1,22) can be drawn in the plane of Figure 2.8. Consider, 
for instance, the linear combination of x, and a2 given by the expression 
z = ba, + box%2. We could draw the vector z by computing its length and 
the angle that it makes with x,. Alternatively, we could apply the rules for 
adding vectors geometrically that were illustrated in Figure 2.4 to the vectors 
bızı and b2a2. This is illustrated in the figure for the case in which bı = 2/3 
and bz = 1/2. 


In precisely the same way, we can represent any three vectors by arrows in 
3-dimensional space, but we leave this task to the reader. It will be easier to 
appreciate the renderings of vectors in three dimensions in perspective that 
appear later on if one has already tried to draw 3-dimensional pictures, or 
even to model relationships in three dimensions with the help of a computer. 


We can finally represent the regression model (2.01) geometrically. This is 
done in Figure 2.9. The horizontal direction is chosen for the vector X6, and 
then the other two vectors y and u are shown in the plane of the page. It 
is clear that, by construction, y = X@+u. Notice that u, the error vector, 
is not orthogonal to X6. The figure contains no reference to any system of 
axes, because there would be n of them, and we would not be able to avoid 
needing n dimensions to treat them all. 


Linear Independence 


In order to define the OLS estimator by the formula (1.46), it is necessary 
to assume that the k x k square matrix X'X is invertible, or nonsingular. 
Equivalently, as we saw in Section 1.4, we may say that X'X has full rank. 
This condition is equivalent to the condition that the columns of X should be 
linearly independent. This is a very important concept for econometrics. Note 
that the meaning of linear independence is quite different from the meaning 
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O AN 


Figure 2.9 The geometry of the linear regression model 


of statistical independence, which we discussed in Section 1.2. It is important 
not to confuse these two concepts. 


The vectors x; through 2, are said to be linearly dependent if we can write 
one of them as a linear combination of the others. In other words, there is a 
vector Œj, 1 < j < k, and coefficients c; such that 


Another, equivalent, definition is that there exist coefficients b;, at least one 
of which is nonzero, such that 


k 
> bizi 0. (2.11) 
i=1 


Recall that O denotes the zero vector, every component of which is 0. It is 
clear from the definition (2.11) that, if any of the a; is itself equal to the zero 
vector, then the a; are linearly dependent. If æ; = 0, for example, then (2.11) 
will be satisfied if we make b; nonzero and set b; = 0 for all i Æ j. 


If the vectors #;, i = 1,...,k, are the columns of an n x k matrix X, then 
another way of writing (2.11) is 


Xb=0, (2.12) 


where b is a k-vector with typical element b;. In order to see that (2.11) 
and (2.12) are equivalent, it is enough to check that the typical elements of 
the two left-hand sides are the same; see Exercise 2.5. The set of vectors 
£i, i = 1,...,k, is linearly independent if it is not linearly dependent, that 
is, if there are no coefficients c; such that (2.10) is true, or (equivalently) no 
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coefficients b; such that (2.11) is true, or (equivalently, once more) no vector 
b such that (2.12) is true. 


It is easy to show that if the columns of X are linearly dependent, the matrix 
X'X is not invertible. Premultiplying (2.12) by X’ yields 


X'Xb=0. (2.13) 


Thus, if the columns of X are linearly dependent, there is a nonzero k—vector 
b which is annihilated by X'X. The existence of such a vector b means that 
X'X cannot be inverted. To see this, consider any vector a, and suppose 
that 

X'Xa=c. 


If X'X could be inverted, then we could premultiply the above equation by 
(XX)! to obtain 
(X'X)e=a. (2.14) 


However, (2.13) also allows us to write 
X'X(a+b)=c, 


which would give 
(X'X)'c=a+b. (2.15) 


But (2.14) and (2.15) cannot both be true, and so (X'X)~! cannot exist. 
Thus a necessary condition for the existence of (X'X)~! is that the columns 
of X should be linearly independent. With a little more work, it can be shown 
that this condition is also sufficient, and so, if the regressors £1,..., £p are 
linearly independent, X'X is invertible. 


If the k columns of X are not linearly independent, then they will span a 
subspace of dimension less than k, say k’, where k’ is the largest number of 
columns of X that are linearly independent of each other. The number k’ is 
called the rank of X. Look again at Figure 2.8, and imagine that the angle 0 
between x; and a2 tends to zero. If 0 = 0, then a, and a2 are parallel, and we 
can write £1 = ao, for some scalar a. But this means that xı — azz = 0, and 
so a relation of the form (2.11) holds between xı and a2, which are therefore 
linearly dependent. In the figure, if 7; and x2 are parallel, then only one 
dimension is used, and there is no need for the second dimension in the plane 
of the page. Thus, in this case, k = 2 and k’ = 1. 


When the dimension of 8(X) is k’ < k, 8(X) will be identical to 8(X'), where 
X’ is an n x k’ matrix consisting of any k’ linearly independent columns of 
X. For example, consider the following X matrix, which is 5 x 3: 


101 
1 4 0 
te 1 (2.16) 
1 4 0 
Loi 
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The columns of this matrix are not linearly independent, since 
£1 = .25£2 + £3. 
However, any two of the columns are linearly independent, and so 
S(X) = S(£1, £2) = S(£1, £3) = S(£2, £3); 


see Exercise 2.8. For the remainder of this chapter, unless the contrary is 
explicitly assumed, we will assume that the columns of any regressor matrix 
X are linearly independent. 


2.3 The Geometry of OLS Estimation 


We studied the geometry of vector spaces in the last section because the nu- 
merical properties of OLS estimates are easily understood in terms of that 
geometry. The geometrical interpretation of OLS estimation, that is, MM es- 
timation of linear regression models, is simple and intuitive. In many cases, 
it entirely does away with the need for algebraic proofs. 


As we saw in the last section, any point in a subspace S( X), where X is an 
n x k matrix, can be represented as a linear combination of the columns of X. 
We can partition X in terms of its columns explicitly, as follows: 


X =| @e + £k]. 


In order to compute the matrix product X6 in terms of this partitioning, we 
need to partition the vector B by its rows. Since @ has only one column, the 
elements of the partitioned vector are just the individual elements of 8. Thus 
we find that 


By 
B k 
XB =|z1 z2 - £p] i = 21, + #29. +...+ tn Pp = X biti, 
: i=l 
Br 


which is just a linear combination of the columns of X. In fact, it is clear 
from the definition (2.09) that any linear combination of the columns of X, 
and thus any element of the subspace 8(X) = S(£1,..., £k), can be written 
as XB for some 8. The specific linear combination (2.09) is constructed by 
using 8 = [by i... i by]. Thus every n-vector X6 belongs to 8(X), which 
is, in general, a k-dimensional subspace of Æ”. In particular, the vector X8 
constructed using the OLS estimator @ belongs to this subspace. 


The estimator B was obtained by solving the equations (1.48), which we 
rewrite here for easy reference: 


X'(y — XB) =0. (1.48) 
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O Ap 


Figure 2.10 Residuals and fitted values 


These equations have a simple geometrical interpretation. Note first that each 
element of the left-hand side of (1.48) is a scalar product. By the rule for 
selecting a single row of a matrix product (see Section 1.4), the i*® element is 


since æ;, the i* column of X, is the transpose of the it? row of X'. By (1.48), 
the scalar product in (2.17) is zero, and so the vector y — XÊ is orthogonal to 
all of the regressors, that is, all of the vectors æ; that represent the explanatory 
variables in the regression. For this reason, equations like (1.48) are often 
referred to as orthogonality conditions. 


Recall from Section 1.5 that the vector y — XØ, treated as a function of p, 
is called the vector of residuals. This vector may be written as u( 8). We 
are interested in u(B), the vector of residuals evaluated at Ê, which is often 
called the vector of least squares residuals and is usually written simply as w. 
We have just seen, in (2.17), that & is orthogonal to all the regressors. This 
implies that & is in fact orthogonal to every vector in 8(X), the span of the 
regressors. To see this, remember that any element of S(X) can be written 
as XB for some 8, with the result that, by (1.48), 


(XB, ù) = (XB)"G = B'XTa =0. 


The vector XÊ is referred to as the vector of fitted values. Clearly, it lies 
in 8(X), and, consequently, it must be orthogonal to &. Figure 2.10 is similar 
to Figure 2.9, but it shows the vector of least squares residuals å and the 
vector of fitted values XÊ instead of u and XZ. The key feature of this 
figure, which is a consequence of the orthogonality conditions (1.48), is that 
the vector & makes a right angle with the vector X{. 


Some things about the orthogonality conditions (1.48) are clearer if we add 
a third dimension to the picture. Accordingly, in panel a) of Figure 2.11, 
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a) y projected on two regressors 
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b) The span 8(£z1, £2) of the regressors c) The vertical plane through y 


Figure 2.11 Linear regression in three dimensions 


we consider the case of two regressors, 2, and £2, which together span the 
horizontal plane labelled 8(£x1, £2), seen in perspective from slightly above 
the plane. Although the perspective rendering of the figure does not make it 
clear, both the lengths of xı and a2 and the angle between them are totally 
arbitrary, since they do not affect S(x1, £2) at all. The vector y is intended 
to be viewed as rising up out of the plane spanned by a; and a2. 


In the 3-dimensional setup, it is clear that, if ù is to be orthogonal to the 
horizontal plane, it must itself be vertical. Thus it is obtained by “dropping 
a perpendicular” from y to the horizontal plane. The least-squares inter- 
pretation of the MM estimator B can now be seen to be a consequence of 
simple geometry. The shortest distance from y to the horizontal plane is 
obtained by descending vertically on to it, and the point in the horizontal 
plane vertically below y, labeled A in the figure, is the closest point in the 
plane to y. Thus ||ù|| minimizes ||u(()||, the norm of u(8), with respect to 8. 
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The squared norm, |/w()||?, is just the sum of squared residuals, SSR(@); 
see (1.49). Since minimizing the norm of u(@) is the same thing as minimiz- 
ing the squared norm, it follows that 6 is the OLS estimator. 


Panel b) of the figure shows the horizontal plane 8(a# 1,2) as a straightfor- 
ward 2-dimensional picture, seen from directly above. The point A is the 
point directly underneath y, and so, since y = XÊ + ù by definition, the 
vector represented by the line segment OA is the vector of fitted values, XB. 
Geometrically, it is much simpler to represent XÊ than to represent just the 
vector 6, because the latter lies in R”, a different space from the space E” 
that contains the variables and all linear combinations of them. However, it is 
easy to see that the information in panel b) does indeed determine B. Plainly, 
xB can be decomposed in just one way as a linear combination of xı and ao, 
as shown. The numerical value of By can be computed as the ratio of the 
length of the vector B,a, to that of x, and similarly for Bo. 


In panel c) of Figure 2.11, we show the right-angled triangle that corresponds 
to dropping a perpendicular from y, labelled in the same way as in panel a). 
This triangle lies in the vertical plane that contains the vector y. We can see 
that y is the hypotenuse of the triangle, the other two sides being XG and ù. 
Thus this panel corresponds to what we saw already in Figure 2.10. Since we 
have a right-angled triangle, we can apply Pythagoras’ Theorem. It gives 


lull? = IXI? + lâl’. (2.18) 
If we write out the squared norms as scalar products, this becomes 
Ty = B'X'XB + (y — XB)"(y — XÔ). (2.19) 


In words, the total sum of squares, or TSS, is equal to the explained sum 
of squares, or ESS, plus the sum of squared residuals, or SSR. This is a 
fundamental property of OLS estimates, and it will prove to be very useful in 
many contexts. Intuitively, it lets us break down the total variation (TSS) of 
the dependent variable into the explained variation (ESS) and the unexplained 
variation (SSR), unexplained because the residuals represent the aspects of y 
about which we remain in ignorance. 


Orthogonal Projections 


When we estimate a linear regression model, we implicitly map the regressand 
y into a vector of fitted values XB and a vector of residuals & = =y- xB. 
Geometrically, these mappings are examples of orthogonal projections. A 
projection is a mapping that takes each point of E” into a point in a subspace 
of E”, while leaving all points in that subspace unchanged. Because of this, 
the subspace is called the invariant subspace of the projection. An orthogonal 
projection maps any point into the point of the subspace that is closest to it. 
If a point is already in the invariant subspace, it is mapped into itself. 
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The concept of an orthogonal projection formalizes the notion of “dropping 
a perpendicular” that we used in the last subsection when discussing least 
squares. Algebraically, an orthogonal projection on to a given subspace can 
be performed by premultiplying the vector to be projected by a suitable pro- 
jection matrix. In the case of OLS, the two projection matrices that yield the 
vector of fitted values and the vector of residuals, respectively, are 


Px = X(X'X)'X', and 


(2.20) 
Mx =1- Px =1- X(X'X) 1X", 


where I is the n x n identity matrix. To see this, recall (2.02), the formula 
for the OLS estimates of 3: 


B=(X'X)1Xly. 
From this, we see that 
XÊ = X(X'X)LXTy = Pxy. (2.21) 


Therefore, the first projection matrix in (2.20), Px, projects on to §(X). For 
any n-vector y, Pxy always lies in 8(X), because 


Pxy = X((X'X)'X'y). 


Since this takes the form Xb for b = ĝ, it is a linear combination of the 
columns of X, and hence it belongs to $8(X). 


From (2.20), it is easy to show that Px X = X. Since any vector in 8(X) 
can be written as Xb for some b € R}, we see that 


PxXb= Xb. (2.22) 


We saw from (2.21) that the result of acting on any vector y € E” with Px is 
a vector in 8(X). Thus the invariant subspace of the projection Px must be 
contained in §(X). But, by (2.22), every vector in 8(X) is mapped into itself 
by Px. Therefore, the image of Px, which is a shorter name for its invariant 
subspace, is precisely 8(X). 

It is clear from (2.21) that, when Px is applied to y, it yields the vector of 
fitted values. Similarly, when Mx, the second of the two projection matrices 
in (2.20), is applied to y, it yields the vector of residuals: 


Mxy = (I- X(X'X)"X" )y=y- Pxy=y— XB = t. 
The image of Mx is 8+(X), the orthogonal complement of the image of Px. 
To see this, consider any vector w € §+(X). It must satisfy the defining condi- 


tion X'w = 0. From the definition (2.20) of Px, this implies that Px w = 0, 
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the zero vector. Since Mx = I — Px, we find that Mxw = w. Thus 8+(X) 
must be contained in the image of Mx. Next, consider any vector in the 
image of Mx. It must take the form Mxy, where y is some vector in E”. 
From this, it will follow that Mxy belongs to 8+(X). Observe that 


(Mxy)'X = y'MxX, (2.23) 
an equality that relies on the symmetry of Mx. Then, from (2.20), we have 
Mex Sa Pk Sx =X =O, (2.24) 


where O denotes a zero matrix, which in this case is n x k. The result (2.23) 
says that any vector Mxy in the image of Mx is orthogonal to X, and thus 
belongs to 8+(X). We saw above that $+(X) was contained in the image 
of Mx, and so this image must coincide with $+(X). For obvious reasons, 
the projection Mx is sometimes called the projection off S(X). 


For any matrix to represent a projection, it must be idempotent. An idem- 
potent matrix is one that, when multiplied by itself, yields itself again. Thus, 


These results are easily proved by a little algebra directly from (2.20), but the 
geometry of the situation makes them obvious. If we take any point, project 
it on to §(X), and then project it on to 8(X) again, the second projection 
can have no effect at all, because the point is already in 8(X), and so it is 
left unchanged. Since this implies that Px Pxy = Pxy for any vector y, it 
must be the case that Px Px = Px, and similarly for Mx. 


Since, from (2.20), 


any vector y € E” is equal to Pxy + Mxy. The pair of projections Px and 
Mx are said to be complementary projections, since the sum of Pxy and 
Mxy restores the original vector y. 


The fact that $(X) and 8+(X) are orthogonal subspaces leads us to say that 
the two projection matrices Px and Mx define what is called an orthogonal 
decomposition of Æ”, because the two vectors Mxy and Pyy lie in the two 
orthogonal subspaces. Algebraically, the orthogonality depends on the fact 
that Px and Mx are symmetric matrices. To see this, we start from a 
further important property of Px and Mx, which is that 


Px Mx = O. (2.26) 
This equation is true for any complementary pair of projections satisfy- 
ing (2.25), whether or not they are symmetric; see Exercise 2.9. We may say 


that Px and Mx annihilate each other. Now consider any vector z € 8(X) 
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and any other vector w € §1(X). We have z = Pxz and w = Mxw. Thus 
the scalar product of the two vectors is 


(Pxz,Mxw) = z'Py Mxw. 


Since Px is symmetric, P} = Px, and so the above scalar product is zero 
by (2.26). In general, however, if two complementary projection matrices are 
not symmetric, the spaces they project on to are not orthogonal. 


The projection matrix Mx annihilates all points that lie in 8(X), and Px 
likewise annihilates all points that lie in $+(X). These properties can be 
proved by straightforward algebra (see Exercise 2.11), but the geometry of 
the situation is very simple. Consider Figure 2.7. It is evident that, if we 
project any point in $+(X) orthogonally on to $(X), we end up at the origin, 
as we do if we project any point in 8(X) orthogonally on to $+(X). 


Provided that X has full rank, the subspace $(X) is k-dimensional, and so the 
first term in the decomposition y = Pxy+ Mxy belongs to a k-dimensional 
space. Since y itself belongs to Æ”, which has n dimensions, it follows that 
the complementary space $+(X) must have n — k dimensions. The number 
n — k is called the codimension of X in E”. 


Geometrically, an orthogonal decomposition y = Pxy + Mxy can be rep- 
resented by a right-angled triangle, with y as the hypotenuse and Px y and 
Myy as the other two sides. In terms of projections, equation (2.18), which 
is really just Pythagoras’ Theorem, can be rewritten as 


lyl? = Paull? + | Mxyl?. (2.27) 


In Exercise 2.10, readers are asked to provide an algebraic proof of this equa- 
tion. Since every term in (2.27) is nonnegative, we obtain the useful result 
that, for any orthogonal projection matrix Px and any vector y € E”, 


|Pxyll < llyll (2.28) 


In effect, this just says that the hypotenuse is longer than either of the other 
sides of a right-angled triangle. 


In general, we will use P and M subscripted by matrix expressions to denote 
the matrices that, respectively, project on to and off the subspaces spanned by 
the columns of those matrix expressions. Thus Pz would be the matrix that 
projects on to §(Z), Mx,w would be the matrix that projects off S( X, W), or, 
equivalently, on to 8t(X, W), and so on. It is frequently very convenient to 
express the quantities that arise in econometrics using these matrices, partly 
because the resulting expressions are relatively compact, and partly because 
the properties of projection matrices often make it easy to understand what 
those expressions mean. However, projection matrices are of little use for 
computation because they are of dimension n x n. It is never efficient to 
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calculate residuals or fitted values by explicitly using projection matrices, and 
it can be extremely inefficient if n is large. 


Linear Transformations of Regressors 


The span $(X) of the regressors of a linear regression can be defined in many 
equivalent ways. All that is needed is a set of k vectors that encompass 
all the k directions of the k-dimensional subspace. Consider what happens 
when we postmultiply X by any nonsingular k x k matrix A. This is called 
a nonsingular linear transformation. Let A be partitioned by its columns, 
which may be denoted a;, i = 1,...,k: 


XA=Xl[a, ao ::: apl=[Xa, Xa: -::: Xag]. 


Each block in the product takes the form Xa;, which is an n-vector that is 
a linear combination of the columns of X. Thus any element of S( XA) must 
also be an element of S(X). But any element of 8(X) is also an element 
of 8(X A). To see this, note that any element of 8(X) can be written as X8 
for some 3 € R*. Since A is nonsingular, and thus invertible, 


XB = XAA“'B = (XA)(A-'8). 


Because A~!G is just a k-vector, this expression is a linear combination of 
the columns of XA, that is, an element of S(X A). Since every element of 
§(XA) belongs to §(X), and every element of $(X) belongs to $8(X A), these 
two subspaces must be identical. 


Given the identity of 8(X) and 8(XA), it seems intuitively compelling to 
suppose that the orthogonal projections Px and Px, should be the same. 
This is in fact the case, as can be verified directly: 


Px, = XA(A'X'XA)1AXT 
= XAA (X XY HA! YA X! 
= X(X' XY HX! = Px. 


When expanding the inverse of the matrix A'X'XA, we used the reversal 
rule for inverses; see Exercise 1.15. 


We have already seen that the vectors of fitted values and residuals depend 
on X only through Px and Mx. Therefore, they too must be invariant to 
any nonsingular linear transformation of the columns of X. Thus if, in the 
regression y = XB +u, we replace X by XA for some nonsingular matrix A, 
the residuals and fitted values will not change, even though Ê will change. 
We will discuss an example of this important result shortly. 


When the set of regressors contains a constant, it is necessary to express it as 
a vector, just like any other regressor. The coefficient of this vector is then 
the parameter we usually call the constant term. The appropriate vector is 4, 
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the vector of which each element equals 1. Consider the n-vector 3,4 + Joa, 
where æ is any nonconstant regressor, and @; and (3 are scalar parameters. 
The t'® element of this vector is 61 + G2x;. Thus adding the vector 6,4 to 
Ga simply adds the scalar (, to each component of Bax. For any regression 
which includes a constant term, then, the fact that we can perform arbitrary 
nonsingular transformations of the regressors without affecting residuals or 
fitted values implies that these vectors are unchanged if we add any constant 
amount to any one or more of the regressors. 


Another implication of the invariance of residuals and fitted values under 
nonsingular transformations of the regressors is that these vectors are un- 
changed if we change the units of measurement of the regressors. Suppose, 
for instance, that the temperature is one of the explanatory variables in a re- 
gression with a constant term. A practical example in which the temperature 
could have good explanatory power is the modeling of electricity demand: 
More electrical power is consumed if the weather is very cold, or, in societies 
where air conditioners are common, very hot. In a few countries, notably the 
United States, temperatures are still measured in Fahrenheit degrees, while 
in most countries they are measured in Celsius (centigrade) degrees. It would 
be disturbing if our conclusions about the effect of temperature on electricity 
demand depended on whether we used the Fahrenheit or the Celsius scale. 


Let the temperature variable, expressed as an n-vector, be denoted as T in 
Celsius and as F' in Fahrenheit, the constant as usual being represented by e. 
Then F = 32 + 9/57, and, if the constant is included in the transformation, 


1 32 
0 9/5] 
The constant and the two different temperature measures are related by a 
linear transformation that is easily seen to be nonsingular, since Fahrenheit 


degrees can be converted back into Celsius. This implies that the residuals 
and fitted values are unaffected by our choice of temperature scale. 


[e Fl=l[e rij (2.29) 


Let us denote the constant term and the slope coefficient as 64 and (2 if we 
use the Celsius scale, and as a; and ag if we use the Fahrenheit scale. Then 
it is easy to see that these parameters are related by the equations 


By = Qı + 32a2 and Bo = 9/502. (2.30) 


To see that this makes sense, suppose that the temperature is at freezing 
point, which is 0° Celsius and 32° Fahrenheit. Then the combined effect of 
the constant and the temperature on electricity demand is 3; + 062 = 61 
using the Celsius scale, and a, + 32q@2 using the Fahrenheit scale. These 
should be the same, and, according to (2.30), they are. Similarly, the effect of 
a 1-degree increase in the Celsius temperature is given by G2. Now 1 Celsius 
degree equals 9/5 Fahrenheit degrees, and the effect of a temperature increase 
of 9/5 Fahrenheit degrees is given by 9/5a2. We are assured by (2.30) that the 
two effects are the same. 
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2.4 The Frisch-Waugh-Lovell Theorem 


In this section, we discuss an extremely useful property of least squares esti- 
mates, which we will refer to as the Frisch-Waugh-Lovell Theorem, or FWL 
Theorem for short. It was introduced to econometricians by Frisch and Waugh 
(1933), and then reintroduced by Lovell (1963). 


Deviations from the Mean 


We begin by considering a particular nonsingular transformation of variables 
in a regression with a constant term. We saw at the end of the last section 
that residuals and fitted values are invariant under such transformations of 
the regressors. For simplicity, consider a model with a constant and just one 
explanatory variable: 

y= Bet Boat u. (2:31) 


In general, x is not orthogonal to v, but there is a very simple transformation 
which makes it so. This transformation replaces the observations in a by 
deviations from the mean. In order to perform the transformation, one first 
calculates the mean of the n observations of the vector æ, 


n 

= 1 

t= 7 Tt; 
t=1 


and then subtracts the constant % from each element of æ. This yields the 
vector of deviations from the mean, z = x — Ze. The vector z is easily seen 
to be orthogonal to 4, because 

uz =e'(@ — Ze) = nz — Fee = nF — nz =0. 
The operation of expressing a variable in terms of the deviations from its 


mean is called centering the variable. In this case, the vector z is the centered 
version of the vector x. 


Since centering leads to a variable that is orthogonal to ų, it can be performed 
algebraically by the orthogonal projection matrix M,. This can be verified 
by observing that 


M,x = (I — P,)æ = æ — (t't) l'x = £ — Tt = z, (2.32) 


as claimed. Here, we once again used the facts that o's = n and e'a = nz. 


The idea behind the use of deviations from the mean is that it makes sense 
to separate the overall level of a dependent variable from its dependence on 
explanatory variables. Specifically, if we write (2.31) in terms of z, we get 


y = (bı + Got)e+ b2z + u = aıt + a22 + u, 
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O Bye ıl 


Figure 2.12 Adding a constant does not affect the slope coefficient 


where we see that 
aı = 81+ b27, and a2 = p2. 


If, for some observation t, the value of 2; were exactly equal to the mean 
value, Z, then z; = 0. Thus we find that y, = a, + u. We interpret this as 
saying that the expected value of y;, when the explanatory variable takes on 
its average value, is the constant ay. 


The effect on y; of a change of one unit in x; is measured by the slope coeff- 
cient 62. If we hold g at its value before x; is changed, then the unit change 
in x, induces a unit change in z. Thus a unit change in z+, which is measured 
by the slope coefficient a2, should have the same effect as a unit change in z+. 
Accordingly, a2 = (32, just as we found above. 


The slope coefficients a2 and 32 would be the same with any constant in the 
place of x. The reason for this can be seen geometrically, as illustrated in 
Figure 2.12. This figure, which is constructed in the same way as panel b) of 
Figure 2.11, depicts the span of ¿ and a, with ų in the horizontal direction. 
As before, the vector y is not shown, because a third dimension would be 
required; the vector would extend from the origin to a point off the plane of 
the page and directly above (or below) the point labelled y. 


The figure shows the vector of fitted values y as the vector sum ĝt + Bon. 
The slope coefficient (32 is the ratio of the length of the vector 62a to that 
of x; geometrically, it is given by the ratio OA/OB. Then a new regressor z 
is defined by adding the constant value c, which is negative in the figure, to 
each component of x, giving z = x + ce. In terms of this new regressor, the 
vector y is given by Qjt + G2z, and Gz is given by the ratio OC/OD. Since 
the ratios OA/OB and OC/OD are clearly the same, we see that @2 = b2. A 
formal argument would use the fact that OAC and OBD are similar triangles. 
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Figure 2.13 Orthogonal regressors may be omitted 


When the constant c is chosen as g, the vector z is said to be centered, and, 
as we saw above, it is orthogonal to +. In this case, the estimate dp is the 
same whether it is obtained by regressing y on both ų¿ and z, or just on z 
alone. This is illustrated in Figure 2.13, which shows what Figure 2.12 would 
look like when z is orthogonal to +. Once again, the vector of fitted values y 
is decomposed as ât + Q2z, with z now at right angles to 4. 


Now suppose that y is regressed on z alone. This means that y is projected 
orthogonally on to §(z), which in the figure is the vertical line through z. By 
definition, 

y= âl + â2Z + Ù, (2.33) 


where û is orthogonal to both ¿ and z. But ų is also orthogonal to z, and 
so the only term on the right-hand side of (2.33) not to be annihilated by 
the projection on to 8(z) is the middle term, which is left unchanged by it. 
Thus the fitted value vector from regressing y on z alone is just @2z, and so 
the OLS estimate is the same âz as given by the regression on both e and z. 
Geometrically, we obtain this result because the projection of y on to 8(z) is 
the same as the projection of y on to 8(z). 


Incidentally, the fact that OLS residuals are orthogonal to all the regressors, 
including vz, leads to the important result that the residuals in any regression 
with a constant term sum to zero. In fact, 


recall (1.29). The residuals will also sum to zero in any regression for which 
t € §(X), even if e does not explicitly appear in the list of regressors. This 
can happen if the regressors include certain sets of dummy variables, as we 
will see in Section 2.5. 
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Two Groups of Regressors 


The results proved in the previous subsection are actually special cases of 
more general results that apply to any regression in which the regressors can 
logically be broken up into two groups. Such a regression can be written as 


y = Xı bı + X2ß2 + U, (2.34) 


where X; is n x kı, X2 is n x ko, and X may be written as the partitioned 
matrix |X; Xə], with k = kı + kə. In the case dealt with in the previous 
subsection, X; is the constant vector ¿ and X> is either æ or z. Several other 
examples of partitioning X in this way will be considered in Section 2.5. 


We begin by assuming that all the regressors in X, are orthogonal to all the 
regressors in X>, so that X3 Xı = O. Under this assumption, the vector of 
least squares estimates 3; from (2.34) is the same as the one obtained from 
the regression 


y = Xi Ai + u, (2.35) 


and 3 from (2.34) is likewise the same as the vector of estimates obtained 
from the regression y = X22 + u2. In other words, when X; and Xə are 
orthogonal, we can drop either set of regressors from (2.34) without affecting 
the coefficients of the other set. 


The vector of fitted values from (2.34) is Pxy, while that from (2.35) is Piy, 
where we have used the abbreviated notation 


P, = Px, = X (Xi X) tX]. 
As we will show directly, 
P,Px = Px P, = Pi; (2.36) 
this is true whether or not X; and Xə are orthogonal. Thus 
Piy = P, Pxy = Pi(Xıĝı + X2ß2) = P, Xıĝı = Xıĝı. (2.37) 


The first equality above, which follows from (2.36), says that the projection 
of y on to $(Xı) is the same as the projection of y = Pxy on to $(X1). 
The second equality follows from the definition of the fitted value vector from 
(2.34) as Px y; the third from the orthogonality of Xı and X2, which implies 
that P,X> = O; and the last from the fact that X, is invariant under the 
action of P,. Since Py is equal to X; postmultiplied by the OLS estimates 
from (2.35), the equality of the leftmost and rightmost expressions in (2.37) 
gives us the result that the same ĝi can be obtained either from (2.34) or 
from (2.35). The analogous result for B2 is proved in just the same way. 
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We now drop the assumption that X; and X> are orthogonal and prove (2.36), 
a very useful result that is true in general. In order to show that Px P, = P,, 
we proceed as follows: 


Px P, = Px X (Xi Xi) XI = X (XT Xi) X =P.. 


The middle equality follows by noting that PxXı = Xı, because all the 
columns of X; are in 8(X), and so are left unchanged by Px. The other 
equality in (2.36), namely P,Px = P}, is obtained directly by transposing 
Px P, = P, and using the symmetry of Px and P,. The two results in (2.36) 
tell us that the product of two orthogonal projections, where one projects on 
to a subspace of the image of the other, is the projection on to that subspace. 
See also Exercise 2.14, for the application of this result to the complementary 
projections Mx and Mı. 


The general result corresponding to the one shown in Figure 2.12 can be 
stated as follows. If we transform the regressor matrix in (2.34) by adding 
X,A to Xə, where A is a ky x ko matrix, and leaving X; as it is, we have 
the regression 

y= XQ} + (Xə + XıA)a2 +u. (2.38) 


Then @» from (2.38) is the same as 32 from (2.34). This can be seen imme- 
diately by expressing the right-hand side of (2.38) as a linear combination of 
the columns of X, and of Xə. 


In the present general context, there is an operation analogous to that of 
centering. The result of centering a variable æ is a variable z that is orthogonal 
to ų¿, the constant. We can create from X> a set of variables orthogonal to X4 
by acting on Xə with the orthogonal projection Mı = I — P,, so as to obtain 
M,X2. This allows us to run the regression 


y = Xıi&@ı = Mı Xə@9ə +u 
= XıiQı F (X2 = X (XIX) X X)a2 +u. 


The first line above is a regression model with two groups of regressors, X4 
and Mı X2, which are mutually orthogonal. Therefore, @2 will be unchanged 
if we omit X,. The second line makes it clear that this regression is a special 
case of (2.38), which implies that G2 is equal to 2 from (2.34). Consequently, 
we see that the two regressions 


y = Xa, + Mı Xəß2 +u and (2.39) 
y = Mı X2ßb2 + v (2.40) 


must yield the same estimates of Go. 


Although regressions (2.34) and (2.40) give the same estimates of G2, they 
do not give the same residuals, as we have indicated by writing u for one 
regression and v for the other. We can see why the residuals are not the same 
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by looking again at Figure 2.13, in which the constant v plays the role of X4, 
and the centered variable z plays the role of M1 Xə. The point corresponding 
to y can be thought of as lying somewhere on a line through the point y 
and sticking perpendicularly out from the page. The residual vector from 
regressing y on both + and z is thus represented by the line segment from y, 
in the page, to y, vertically above the page. However, if y is regressed on 
z alone, the residual vector is the sum of this line segment and the segment 
from @2z and y, that is, the top side of the rectangle in the figure. If we want 
the same residuals in regression (2.34) and a regression like (2.40), we need to 
purge the dependent variable of the second segment, which can be seen from 
the figure to be equal to d,e. 


This suggests replacing y by what we get by projecting y off i. This projec- 
tion would be the line segment perpendicular to the page, translated in the 
horizontal direction so that it intersected the page at the point âz rather 
than y. In the general context, the analogous operation replaces y by Myy, 
the projection off Xı rather than off e. When we perform this projection, 
(2.40) is replaced by the regression 


My = M,X2B2 + residuals, (2.41) 


which will yield the same vector of OLS estimates 32 as (2.34), and also 
the same vector of residuals. This regression is sometimes called the FWL 
regression. We used the notation “+ residuals” instead of “+ u” in (2.41) 
because, in general, the difference between My and M1 X22 is not the same 
thing as the vector u in (2.34). If u is interpreted as an error vector, then 
(2.41) would not be true if “residuals” were replaced by u. 

We can now formally state the FWL Theorem. Although the conclusions of 
the theorem have been established gradually in this section, we also provide 
a short formal proof. 


Theorem 2.1. (Frisch-Waugh-Lovell Theorem) 


1. The OLS estimates of B2 from regressions (2.34) and (2.41) are 
numerically identical. 


2. The residuals from regressions (2.34) and (2.41) are numerically 
identical. 


Proof: By the standard formula (1.46), the estimate of G2 from (2.41) is 
(X? Mı X2) `X} Muy. (2.42) 
Let 3, and Ê> denote the two vectors of OLS estimates from (2.34). Then 
y = Pxy + Mxy = Xıĝı + X2Bo+ Mxy. (2.43) 


Premultiplying the leftmost and rightmost expressions in (2.43) by X} M1, 
we obtain i 
Xð Myy = X? Mı Xə Bo. (2.44) 
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The first term on the right-hand side of (2.43) has dropped out because Mı 
annihilates X,. To see that the last term also drops out, observe that 


Mx M, Xz = Mx X2 = O. (2.45) 


The first equality follows from (2.36) (see also Exercise 2.14), and the second 
from (2.24), which shows that Mx annihilates all the columns of X, in par- 
ticular those of X>. Premultiplying y by the transpose of (2.45) shows that 
X}M,Mxy = 0. We can now solve (2.44) for G2 to obtain 


Go = (XM, Xə) 1X] Mıy, 


which is expression (2.42). This proves the first part of the theorem. 


If we had premultiplied (2.43) by M; instead of by X Mı, we would have 
obtained i 
Myy = Mı X2ß2 + Mxy, (2.46) 


where the last term is unchanged from (2.43) because Mı Mx = Mx. The 
regressand in (2.46) is the regressand from regression (2.41). Because 2 is the 
estimate of 32 from (2.41), by the first part of the theorem, the first term on 
the right-hand side of (2.46) is the vector of fitted values from that regression. 
Thus the second term must be the vector of residuals from regression (2.41). 
But Mxy is also the vector of residuals from regression (2.34), and this 
therefore proves the second part of the theorem. a 


2.5 Applications of the FWL Theorem 


A regression like (2.34), in which the regressors are broken up into two groups, 
can arise in many situations. In this section, we will study three of these. The 
first two, seasonal dummy variables and time trends, are obvious applications 
of the FWL Theorem. The third, measures of goodness of fit that take the 
constant term into account, is somewhat less obvious. In all cases, the FWL 
Theorem allows us to obtain explicit expressions based on (2.42) for subsets 
of the parameter estimates of a linear regression. 


Seasonal Dummy Variables 


For a variety of reasons, it is sometimes desirable to include among the ex- 
planatory variables of a regression model variables that can take on only two 
possible values, which are usually 0 and 1. Such variables are called indicator 
variables, because they indicate a subset of the observations, namely, those 
for which the value of the variable is 1. Indicator variables are a special case 
of dummy variables, which can take on more than two possible values. 


Seasonal variation provides a good reason to employ dummy variables. It 
is common for economic data that are indexed by time to take the form of 
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quarterly data, where each year in the sample period is represented by four 
observations, one for each quarter, or season, of the year. Many economic 
activities are strongly affected by the season, for obvious reasons like Christ- 
mas shopping, or summer holidays, or the difficulty of doing outdoor work 
during very cold weather. This seasonal variation, or seasonality, in economic 
activity is likely to be reflected in the economic time series that are used in 
regression models. The term “time series” is used to refer to any variable the 
observations of which are indexed by the time. Of course, time-series data are 
sometimes annual, in which case there is no seasonal variation to worry about, 
and sometimes monthly, in which case there are twelve “seasons” instead of 
four. For simplicity, we consider only the case of quarterly data. 


Since there are four seasons, there may be four seasonal dummy variables, 
each taking the value 1 for just one of the four seasons. Let us denote these 
variables as $1, S2, $3, and s4. If we consider a sample the first observation of 
which corresponds to the first quarter of some year, these variables look like 


1 0 0 0 
0 1 0 0 
0 0 1 0 
0 0 0 1 
siı=|l1f, s2=ļ0|, ss=|0|, s= |0 (2.47) 
0 1 0 0 
0 0 1 0 
0 0 0 1 


An important property of these variables is that, since every observation must 
correspond to some season, the sum of the seasonal dummies must indicate 
every season. This means that this sum is a vector every component of which 
equals 1. Algebraically, 


S1 + S2 + S3 + S4 = L, (2.48) 


as is clear from (2.47). Since e represents the constant in a regression, (2.48) 
means that the five-variable set consisting of all four seasonal dummies plus 
the constant is linearly dependent. Consequently, one of the five variables 
must be dropped if all the regressors are to be linearly independent. 


Just which one of the five is dropped makes no difference to the fitted values 
and residuals of a regression, because it is easy to check that 


(81, $2, 83, 84) E S(t, S2, 83, s4) = S(t, $1, $3, s4), 


and so on. However the parameter estimates associated with the set of four 
variables that we choose to keep have different interpretations depending on 
that choice. Suppose first that we drop the constant and run the regression 


Y = a 18) + a282 + Q383 + &4S4 + XB +u, (2.49) 
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where the n x k matrix X contains other explanatory variables. Consider a 
single observation, indexed by t, that corresponds to the first season. The tt? 
observations of s2, s3, and s4 are all 0, and that of sı is 1. Thus, if we write 
out the tt? observation of (2.49), we get 


Ye = 01+ Xib + ut. 


From this it is clear that, for all t belonging to the first season, the constant 
term in the regression is a ;. If we repeat this exercise for t in the second, 
third, or fourth season, we see at once that a; is the constant for season i. 
Thus the introduction of the seasonal dummies gives us a different constant 
for every season. 


An alternative is to retain the constant and drop s1. This yields 
Y = Aol + 7282 + 383 + Y4S4 + XB +u. 


It is clear that, in this specification, the overall constant ag is really the 
constant for season 1. For an observation belonging to season 2, the constant 
is Qo + %2, for an observation belonging to season 3, it is ao + 73, and so 
on. The easiest way to interpret this is to think of season 1 as the reference 
season. The coefficients 7;, i = 2,3,4, measure the difference between ag, 
the constant for the reference season, and the constant for season 7. Since we 
could have dropped any of the seasonal dummies, the reference season is, of 
course, entirely arbitrary. 


Another alternative is to retain the constant and use the three dummy vari- 
ables defined by 


/ / / 
8, = S1 — S4, 8,= S2— S4, 83 = S3 — $4. (2.50) 


These new dummy variables are not actually indicator variables, because their 
components for season 4 are equal to —1, but they have the advantage that, 
for each complete year, the sum of their components for that year is 0. Thus, 
for any sample whose size is a multiple of 4, each of the s, i = 1,2,3, is 
orthogonal to the constant. We can write the regression as 


y = Oot + 618) + 6285 +6383 + XB + u. (2.51) 


It is easy to see that, for t in season 7, i = 1,2,3, the constant term is ðo + ĝi. 
For t belonging to season 4, it is d9 — 6, — ô2 — 63. Thus the average of 
the constants for all four seasons is just ôo, the coefficient of the constant, v. 
Accordingly, the 6;, i = 1,2,3, measure the difference between the average 
constant ôo and the constant specific to season 7. Season 4 is a bit of a mess, 
because of the arithmetic needed to ensure that the average does indeed work 
out to do. 
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Let S denote whatever n x 4 matrix we choose to use in order to span the 
constant and the four seasonal variables s;. Then any of the regressions we 
have considered so far can be written as 


y = Sð + XB +u. (2.52) 


This regression has two groups of regressors, as required for the application 
of the FWL Theorem. That theorem implies that the estimates 8 and the 
residuals & can also be obtained by running the FWL regression 


Msy = Ms XG + residuals, (2.53) 


where, as the notation suggests, Ms = I — S(S'S)-'S". 


The effect of the projection Mg on y and on the explanatory variables in the 
matrix X can be considered as a form of seasonal adjustment. By making 
Ms y orthogonal to all the seasonal variables, we are, in effect, purging it of its 
seasonal variation. Consequently, Ms y can be called a seasonally adjusted, 
or deseasonalized, version of y, and similarly for the explanatory variables. In 
practice, such seasonally adjusted variables can be conveniently obtained as 
the residuals from regressing y and each of the columns of X on the variables 
in S. The FWL Theorem tells us that we get the same results in terms of 
estimates of 6 and residuals whether we run (2.52), in which the variables are 
unadjusted and seasonality is explicitly accounted for, or run (2.53), in which 
all the variables are seasonally adjusted by regression. This was, in fact, the 
subject of the famous paper by Lovell (1963). 


The equivalence of (2.52) and (2.53) is sometimes used to claim that, in esti- 
mating a regression model with time-series data, it does not matter whether 
one uses “raw” data, along with seasonal dummies, or seasonally adjusted 
data. Such a conclusion is completely unwarranted. Official seasonal adjust- 
ment procedures are almost never based on regression; using official seasonally 
adjusted data is therefore not equivalent to using residuals from regression on 
a set of seasonal variables. Moreover, if (2.52) is not a sensible model (and 
it would not be if, for example, the seasonal pattern were more complicated 
than that given by Sa), then (2.53) is not a sensible specification either. 
Seasonality is actually an important practical problem in applied work with 
time-series data. We will discuss it further in Chapter 13. For more detailed 
treatments, see Hylleberg (1986, 1992) and Ghysels and Osborn (2001). 


The deseasonalization performed by the projection Mg makes all variables 
orthogonal to the constant as well as to the seasonal dummies. Thus the 
effect of Ms is not only to deseasonalize, but also to center, the variables 
on which it acts. Sometimes this is undesirable; if so, we may use the three 
variables s; given in (2.50). Since they are themselves orthogonal to the 
constant, no centering takes place if only these three variables are used for 
seasonal adjustment. An explicit constant should normally be included in any 
regression that uses variables seasonally adjusted in this way. 
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Time Trends 


Another sort of constructed, or artificial, variable that is often encountered 
in models of time-series data is a time trend. The simplest sort of time trend 
is the linear time trend, represented by the vector T, with typical element 
T; =t. Thus T = [1!2:3:4:...]. Imagine that we have a regression with 
a constant and a linear time trend: 


y= net wT + XG + u. 


For observation t, y is equal to y1 + yot + XB + ut. Thus the overall level 
of y, increases or decreases steadily as t increases. Instead of just a constant, 
we now have the linear (strictly speaking, affine) function of time, y1 + yet. 
An increasing time trend might be appropriate, for instance, in a model of a 
production function where technical progress is taking place. An explicit 
model of technical progress might well be difficult to construct, in which 
case a linear time trend could serve as a simple way to take account of the 
phenomenon. 


It is often desirable to make the time trend orthogonal to the constant by 
centering it, that is, operating on it with M,. If we do this with a sample 
with an odd number of elements, the result is a variable that looks like 


oe eee ae 


If the sample size is even, the variable is made up of the half integers +1/, 
+3/5, +5/5,.... In both cases, the coefficient of ų¿ is the average value of the 
linear function of time over the whole sample. 


Sometimes it is appropriate to use constructed variables that are more com- 
plicated than a linear time trend. A simple case would be a quadratic time 
trend, with typical element t?. In fact, any deterministic function of the time 
index t can be used, including the trigonometric functions sint and cost, 
which could be used to account for oscillatory behavior. With such variables, 
it is again usually preferable to make them orthogonal to the constant by 
centering them. 


The FWL Theorem applies just as well with time trends of various sorts as 
it does with seasonal dummy variables. It is possible to project all the other 
variables in a regression model off the time trend variables, thereby obtaining 
detrended variables. The parameter estimates and residuals will be same as 
if the trend variables were explicitly included in the regression. This was in 
fact the type of situation dealt with by Frisch and Waugh (1933). 


Goodness of Fit of a Regression 


In equations (2.18) and (2.19), we showed that the total sum of squares (TSS) 
in the regression model y = X6 + u can be expressed as the sum of the 
explained sum of squares (ESS) and the sum of squared residuals (SSR). 
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This was really just an application of Pythagoras’ Theorem. In terms of the 
orthogonal projection matrices Px and Mx, the relation between TSS, ESS, 
and SSR can be written as 


TSS = ||yl? = ||Pxyll? + | Mxyl|’ = ESS + SSR. 


This allows us to introduce a measure of goodness of fit for a regression model. 
This measure is formally called the coefficient of determination, but it is 
universally referred to as the R?. The R? is simply the ratio of ESS to TSS. 
It can be written as 

ESS ||Pxyll? _ 1 |Mxyll? _, SSR 


= = = 1 = cos? 2.54 
T55 © Jigi ME to ~% Pag 


R = 


where 0 is the angle between y and Px y; see Figure 2.10. For any angle 0, 
we know that —1 < cos@ < 1. Consequently, 0 < R? < 1. If the angle 6 were 
zero, y and XB would coincide, the residual vector % would vanish, and we 
would have what is called a perfect fit, with R? = 1. At the other extreme, if 
R? = 0, the fitted value vector would vanish, and y would coincide with the 
residual vector ù. 


As we will see shortly, (2.54) is not the only measure of goodness of fit. It is 
known as the uncentered R?, and, to distinguish it from other versions of R?, 
it is sometimes denoted as R?. Because R? depends on y only through the 
residuals and fitted values, it is invariant under nonsingular linear transforma- 
tions of the regressors. In addition, because it is defined as a ratio, the value 
of R? is invariant to changes in the scale of y. For example, we could change 
the units in which the regressand is measured from dollars to thousands of 
dollars without affecting the value of R2. 


However, R? is not invariant to changes of units that change the angle 6. An 
example of such a change is given by the conversion between the Celsius and 
Fahrenheit scales of temperature, where a constant is involved; see (2.29). To 
see this, let us consider a very simple change of measuring units, whereby a 
constant a, analogous to the constant 32 used in converting from Celsius to 
Fahrenheit, is added to each element of y. In terms of these new units, the 
regression of y on a regressor matrix X becomes 


yt+tar= XB+u. (2.55) 


If we assume that the matrix X includes a constant, it follows that Pxu =v 
and Mx = 0, and so we find that 


ytar= Px (y+ at) + Mx (y+ ae) = Pxyt+ar+ Mxy. 
This allows us to compute R? as 


eo lPxy + aul? 
«Ty Fede 
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which is clearly different from (2.54). By choosing a sufficiently large, we can 
in fact make R? as close as we wish to 1, because, for very large a, the term 
at will completely dominate the terms Pxy and y in the numerator and 
denominator respectively. But a large R? in such a case would be entirely 
misleading, since the “good fit” would be accounted for almost exclusively by 
the constant. 


It is easy to see how to get around this problem, at least for regressions that 
include a constant term. An elementary consequence of the FWL Theorem 
is that we can express all variables as deviations from their means, by the 
operation of the projection M,, without changing parameter estimates or 
residuals. The ordinary R? from the regression that uses centered variables is 
called the centered R?. It is defined as 


pe IPxMul? _; _ Mxyll? 


c — E ’ (2.56) 
|My? [Myl 


and it is clearly unaffected by the addition of a constant to the regressand, as 
in equation (2.55). 


The centered R? is much more widely used than the uncentered R?. When + 
is contained in the span 8(X) of the regressors, R? certainly makes far more 
sense than R2. However, R2 does not make sense for regressions without a 
constant term or its equivalent in terms of dummy variables. If a statistical 
package reports a value for R? in such a regression, one needs to be very 
careful. Different ways of computing R?, all of which would yield the same, 
correct, answer for regressions that include a constant, may yield quite differ- 
ent answers for regressions that do not. It is even possible to obtain values of 
R? that are less than 0 or greater than 1, depending on how the calculations 
are carried out. 


Either version of R? is a valid measure of goodness of fit only when the least 
squares estimates B are used. If we used some other estimates of 3, say 6, 
the triangle in Figure 2.10 would no longer be a right-angled triangle, and 
Pythagoras’ Theorem would no longer apply. As a consequence, (2.54) would 
no longer hold, and the different definitions of R? would no longer be the 


same: a 7 
ly - Xl? , IXB? 
ly? lly ll? 


If we chose to define R? in terms of the residuals, using the first of these 
expressions, we could not guarantee that it would be positive, and if we chose 
to define it in terms of the fitted values, using the second, we could not 
guarantee that it would be less than 1. Thus, when anything other than 
least squares is used to estimate a regression, one should be very cautious 
about interpreting a reported R?. It is not a sensible measure of fit in such 
a case, and, depending on how it is actually computed, it may be seriously 
misleading. 


1 


+ 
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Figure 2.14 An influential observation 


2.6 Influential Observations and Leverage 


One important feature of OLS estimation, which we have not stressed up to 
this point, is that each element of the vector of parameter estimates B is 
simply a weighted average of the elements of the vector y. To see this, define 
c; as the it" row of the matrix (X'X)~!X7 and observe from (2.02) that 
Bi = ciy. This fact will prove to be of great importance when we discuss the 
statistical properties of least squares estimation in the next chapter. 


Because each element of B is a weighted average, some observations may 
affect the value of B much more than others do. Consider Figure 2.14. This 
figure is an example of a scatter diagram, a long-established way of graphing 
the relation between two variables. Each point in the figure has Cartesian 
coordinates (x+, yz), where x; is a typical element of a vector x, and y of a 
vector y. One point, drawn with a larger dot than the rest, is indicated, for 
reasons to be explained, as a high leverage point. Suppose that we run the 
regression 
y= fet box +u 


twice, once with, and once without, the high leverage observation. For each 
regression, the fitted values all lie on the so-called regression line, which is 
the straight line with equation 


y= ĝi + Box. 


The slope of this line is just Bo, which is why 82 is sometimes called the slope 
coefficient; see Section 1.1. Similarly, because (; is the intercept that the 
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regression line makes with the y axis, the constant term (3; is sometimes called 
the intercept. The regression line is entirely determined by the estimated 
coefficients, 3, and 82. 


The regression lines for the two regressions in Figure 2.14 are substantially 
different. The high leverage point is quite distant from the regression line 
obtained when it is excluded. When that point is included, it is able, by 
virtue of its position well to the right of the other observations, to exert a 
good deal of leverage on the regression line, pulling it down toward itself. 
If the y coordinate of this point were greater, making the point closer to 
the regression line excluding it, then it would have a smaller influence on 
the regression line including it. If the x coordinate were smaller, putting 
the point back into the main cloud of points, again there would be a much 
smaller influence. Thus it is the x coordinate that gives the point its position 
of high leverage, but it is the y coordinate that determines whether the high 
leverage position will actually be exploited, resulting in substantial influence 
on the regression line. In a moment, we will generalize these conclusions to 
regressions with any number of regressors. 


If one or a few observations in a regression are highly influential, in the sense 
that deleting them from the sample would change some elements of 3 sub- 
stantially, the prudent econometrician will normally want to scrutinize the 
data carefully. It may be that these influential observations are erroneous, or 
at least untypical of the rest of the sample. Since a single erroneous obser- 
vation can have an enormous effect on Ê, it is important to ensure that any 
influential observations are not in error. Even if the data are all correct, the 
interpretation of the regression results may change if it is known that a few ob- 
servations are primarily responsible for them, especially if those observations 
differ systematically in some way from the rest of the data. 


Leverage 

The effect of a single observation on B can be seen by comparing B with BO), 
the estimate of @ that would be obtained if the tt? observation were omitted 
from the sample. Rather than actually omit the tt}? observation, it is easier 
to remove its effect by using a dummy variable. The appropriate dummy 
variable is e;, an n-vector which has t*® element 1 and all other elements 0. 
The vector e+, is called a unit basis vector, unit because its norm is 1, basis 
because the set of all the e+, for t = 1,...,n, span, or constitute a basis for, 
the full space E”; see Exercise 2.20. Considered as an indicator variable, e+ 
indexes the singleton subsample that contains only observation t. 


Including e; as a regressor leads to a regression of the form 
y= XB+ae,+u, (2.57) 


and, by the FWL Theorem, this gives the same parameter estimates and 
residuals as the FWL regression 


My = M,X + residuals, (2.58) 
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where M; = Me, =1- elele) tet is the orthogonal projection off the 
vector e;. It is easy to see that Myy is just y with its tt? component replaced 
by 0. Since ele; = 1, and since e;' y can easily be seen to be the t+? component 
of y, 

My =y- erer y = yY- yer. 


Thus y; is subtracted from y for the tt? observation only. Similarly, M, X 
is just X with its tt? row replaced by zeros. Running regression (2.58) will 
give the same parameter estimates as those that would be obtained if we 
deleted observation t from the sample. Since the vector is defined exclusively 
in terms of scalar products of the variables, replacing the t'® elements of 
these variables by 0 is tantamount to simply leaving observation t out when 
computing those scalar products. 


Let us denote by Pz and Mz, respectively, the orthogonal projections on to 


and off S(X,e:). The fitted values and residuals from regression (2.57) are 
then given by 


y = Pzy + Mzy = XB” + de, + Mzy. (2.59) 
Now premultiply (2.59) by Px to obtain 
Pxy = XB + @Pxe;, (2.60) 


where we have used the fact that Mz Px = O, because Mz annihilates both 
X and e;. But Pxy = XB, and so (2.60) gives 


X(B — B) =-aPxe. (2.61) 


We can compute the difference between B © and B from this if we can compute 
the value of â. 


In order to calculate â, we once again use the FWL Theorem, which tells us 
that the estimate of a from (2.57) is the same as the estimate from the FWL 
regression 

Myy = âMyx e, + residuals. 


Therefore, using (2.02) and the idempotency of Mx, 


e; Mxy 


— 2.62 
e; Mxe; ( ) 


a= 
Now e; Mxy is the t element of Mxy, the vector of residuals from the 
regression including all observations. We may denote this element as ti. In 


like manner, el Mx e;, which is just a scalar, is the tt? diagonal element 
of Mx. Substituting these into (2.62), we obtain 


te 
1— hy’ 


(2.63) 


a= 
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where h; denotes the t! diagonal element of Px, which is equal to 1 minus 
the t* diagonal element of Mx. The rather odd notation h; comes from the 
fact that Px is sometimes referred to as the hat matrix, because the vector 
of fitted values XÊ = Pyy is sometimes written as g, and Px is therefore 
said to “put a hat on” y. 


Finally, if we premultiply (2.61) by (X'X)~1X" and use (2.63), we find that 


l= 


BO -=al X X 1X Pre, = (XX: Cd 


The second equality uses the facts that X'Px = X' and that the final factor 
of e, selects the t*® column of X', which is the transpose of the tt? row, X;. 
Expression (2.64) makes it clear that, when either i; is large or h is large, or 
both, the effect of the tt? observation on at least some elements of @ is likely 
to be substantial. Such an observation is said to be influential. 


From (2.64), it is evident that the influence of an observation depends on both 
a, and hi. It will be greater if the observation has a large residual, which, 
as we saw in Figure 2.14, is related to its y coordinate. On the other hand, 
ht is related to the x coordinate of a point, which, as we also saw in the 
figure, determines the leverage, or potential influence, of the corresponding 
observation. We say that observations for which h; is large have high leverage 
or are leverage points. A leverage point is not necessarily influential, but it 
has the potential to be influential. 


The Diagonal Elements of the Hat Matrix 


Since the leverage of the tt? observation depends on h+, the ¢*® diagonal ele- 
ment of the hat matrix, it is worth studying the properties of these diagonal 
elements in a little more detail. We can express h; as 


ht = e Pxe; = || Px e;||?. (2.65) 


Since the rightmost expression here is a square, h; > 0. Moreover, since 
\|e:|| = 1, we obtain from (2.28) applied to e; that hy = ||Pxez||? < 1. Thus 


O<hm <1. (2.66) 


The geometrical reason for these bounds on the value of hy can be found in 
Exercise 2.26. 


The lower bound in (2.66) can be strengthened when there is a constant term. 
In that case, none of the h; can be less than 1/n. This follows from (2.65), 
because if X consisted only of a constant vector t, e} P,e, would equal 1/n. 
If other regressors are present, then we have 


1/n = ||Prer||? = ||P, Pxe:l? < ||Pxeell? = h. 
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Here we have used the fact that P,Px = P, since « is in 8(X) by assumption, 
and, for the inequality, we have used (2.28). Although h; cannot be 0 in normal 
circumstances, there is a special case in which it equals 1. If one column of 
X is the dummy variable e;, h; = el Pye, = ele; = 1. 


In a regression with n observations and k regressors, the average of the h is 
equal to k/n. In order to demonstrate this, we need to use some properties 
of the trace of a square matrix. If A is an n x n matrix, its trace, denoted 
Tr(A), is the sum of the elements on its principal diagonal. Thus 


i=1 


A convenient property is that the trace of a product of two not necessarily 
square matrices A and B is unaffected by the order in which the two matrices 
are multiplied together. If the dimensions of A are n x m, then, in order for 
the product AB to be square, those of B must be m xn. This implies further 
that the product BA exists and is m x m. We have 


m 


Tr(AB) = 3 AB): = a yee = X_(BA);; =Tr(BA). (2.67) 


t= 9=1 g=1 


The result (2.67) can be extended. If we consider a (square) product of several 
matrices, the trace is invariant under what is called a cyclic permutation of 
the factors. Thus, as can be seen by successive applications of (2.67), 


Tr(ABC) = Tr(CAB) = Tr(BCA). (2.68) 
We now return to the h;. Their sum is 


T —lyT 
Yoh Tr(Px) = Tr(X(X'X) X!) Bie 


= Tr( (XX) 'A"X) = Tr) = k. 


The first equality in the second line makes use of (2.68). Then, because we 
are multiplying a k x k matrix by its inverse, we get a k x k identity matrix, 
the trace of which is obviously just k. It follows from (2.69) that the average 
of the h; equals k/n. When, for a given regressor matrix X, the diagonal 
elements of Px are all close to their average value, no observation has very 
much leverage. Such an X matrix is sometimes said to have a balanced design. 
On the other hand, if some of the h, are much larger than k/n, and others 
consequently smaller, the X matrix is said to have an unbalanced design. 
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ht 


Xt 


Figure 2.15 hz as a function of X; 


The h; tend to be larger for values of the regressors that are farther away 
from their average over the sample. As an example, Figure 2.15 plots them 
as a function of X; for a particular sample of 100 observations for the model 


Yt = bı + b2Xt + ut. 


The elements X; of the regressor are perfectly well behaved, being drawings 
from the standard normal distribution. Although the average value of the h 
is 2/100 = 0.02, h; varies from 0.0100 for values of X; near the sample mean to 
0.0695 for the largest value of X+, which is about 2.4 standard deviations above 
the sample mean. Thus, even in this very typical case, some observations have 
a great deal more leverage than others. Those observations with the greatest 
amount of leverage are those for which 2; is farthest from the sample mean, 
in accordance with the intuition of Figure 2.14. 


2.7 Final Remarks 


In this chapter, we have discussed the numerical properties of OLS estimation 
of linear regression models from a geometrical point of view. This perspective 
often provides a much simpler way to understand such models than does a 
purely algebraic approach. For example, the fact that certain matrices are 
idempotent becomes quite clear as soon as one understands the notion of 
an orthogonal projection. Most of the results discussed in this chapter are 
thoroughly fundamental, and many of them will be used again and again 
throughout the book. In particular, the FWL Theorem will turn out to be 
extremely useful in many contexts. 


The use of geometry as an aid to the understanding of linear regression has 
a long history; see Herr (1980). One valuable reference on linear models that 
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takes the geometric approach is Seber (1980). A good expository paper that 
is reasonably accessible is Bryant (1984), and a detailed treatment is provided 
by Ruud (2000). 


It is strongly recommended that readers attempt the exercises which follow 
this chapter before starting Chapter 3, in which we turn our attention to the 
statistical properties of OLS estimation. Many of the results of this chapter 
will be useful in establishing these properties, and the exercises are designed 
to enhance understanding of these results. 


2.8 Exercises 


2.1 Consider two vectors æ and y in E”. Let æ = [x1 i x2] and y = [y1 i y2]. Show 
trigonometrically that a!y = x1yı + xyz is equal to ||z|| ||y|| cos@, where 0 
is the angle between æ and y. 


2.2 A vector in Æ” can be normalized by multiplying it by the reciprocal of its 
norm. Show that, for any x € E” with x £0, the norm of æ/||æ]|| is 1. 


Now consider two vectors x,y € E”. Compute the norm of the sum and of 
the difference of x normalized and y normalized, that is, of 


wy nd Z Y. 
læl [lal læl Iyl 


By using the fact that the norm of any nonzero vector is positive, prove the 
Cauchy-Schwartz inequality (2.08): 


= 
læ y| < [lel llull. (2.08) 


Show that this inequality becomes an equality when a and y are parallel. 
Hint: Show first that x and y are parallel if and only if æ/||æ|| = + y/|ly|]. 


2.3 The triangle inequality states that, for x,y € E”, 
læ + yll < llæll + Iyl. (2.70) 


Draw a 2-dimensional picture to illustrate this result. Prove the result alge- 
braically by computing the squares of both sides of the above inequality, and 
then using (2.08). In what circumstances will (2.70) hold with equality? 


2.4 Suppose that æ% = [1.0 į 1.5 į 1.2 į 0.7] and y = [3.2 į 4.4: 2.5 i 2.0]. What are 
|||, Iyl, and x'y? Use these quantities to calculate 0, the angle @ between 
x and y, and cos @. 


2.5 Show explicitly that the left-hand sides of (2.11) and (2.12) are the same. 
This can be done either by comparing typical elements or by using the results 
in Section 2.3 on partitioned matrices. 


2.6 Prove that, if the k columns of X are linearly independent, each vector z in 
S(X) can be expressed as Xb for one and only one k-vector b. Hint: Suppose 
that there are two different vectors, bı and b2, such that z = Xb;, i = 1,2, 
and show that this implies that the columns of X are linearly dependent. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


84 


2.7 


2.8 


2.9 


2.10 


2.11 


2.12 


2.13 


2.14 
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Consider the vectors #1 = [1 | 2: 4], x2 = [2 : 3 5], and wg = [3 : 6 : 12). 
What is the dimension of the subspace that these vectors span? 

Consider the example of the three vectors £1, x2, and x3 defined in (2.16). 
Show that any vector z = bja1 + box2 in (£1, £2) also belongs to $(x1, x3) 
and 8(a%2,%3). Give explicit formulas for z as a linear combination of xı 
and x3, and of x2 and x3. 


Prove algebraically that Px Mx = O. This is equation (2.26). Use only the 
requirement (2.25) that Px and Mx be complementary projections, and the 
idempotency of Px. 


Prove algebraically that equation (2.27), which is really Pythagoras’ Theorem 
for linear regression, holds. Use the facts that Px and Mx are symmetric, 
idempotent, and orthogonal to each other. 


Show algebraically that, if Px and Mx are complementary orthogonal pro- 
jections, then Mx annihilates all vectors in 8(X), and Px annihilates all 
vectors in §+(X). 


Consider the two regressions 


y = bızı + Pox + 03%3 +U, and 
Y = 0121 + 0922 +0323 + U, 


where z1 = £1 — 2£2, Z2 = 42 + 443, and z3 = 2a, — 3£z2 + 5a3. Let 
X = |£z1 £2 x3] and Z=([z, z2 z3]. Show that the columns of Z can be 
expressed as linear combinations of the columns of X, that is, that Z = XA, 
for some 3 x 3 matrix A. Find the elements of this matrix A. 


Show that the matrix A is invertible, by showing that the columns of X are 
linear combinations of the columns of Z. Give the elements of A~!. Show 
that the two regressions give the same fitted values and residuals. 


Precisely how is the OLS estimate By related to the OLS estimates â;, for 
i = 1,...,3? Precisely how is âı related to the 6;, fori =1,...,3? 


Let X be an n x k matrix of full rank. Consider the n x k matrix XA, where 


A is a singular k x k matrix. Show that the columns of XA are linearly 
dependent, and that S(XA) c 8(X). 


Use the result (2.36) to show that Mx Mı = M,Mx = Mx, where X = 
[Xi Xə]. 


Consider the following linear regression: 
y = Xıbı + X2ßb2 +U, 


where y is n x 1, X1 is n x ky, and X2q is n x kg. Let By and Bo be the OLS 
parameter estimates from running this regression. 


Now consider the following regressions, all to be estimated by OLS: 


a) y = X2 b2 + u; 
b) Pry = X2ß2 + u; 
c) Piy = Pi X2 82+ u; 


d) Pxy = Xıßı + X2ß2 + u; 
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2.16 


2.17 


2.18 


2.19 


2.20 


2.21 


e) Pxy = X2ß2 + u; 

f) Miy = X262 + u; 

g) Mıy = Mı X2 ß2 + u; 

h) Miy = Xıßı + Mı X262 + u; 

i) My = Mı X18) + Mı X262 + U; 
j) Pxy = MıX2ß2 +u. 


Here Pı projects orthogonally on to the span of X1, and Mi = I — P,. For 
which of the above regressions will the estimates of B2 be the same as for the 
original regression? Why? For which will the residuals be the same? Why? 


Consider the linear regression 
y = bit + X2 b2 + U, 


where + is an n-vector of 1s, and X> is an n x (k — 1) matrix of observations 
on the remaining regressors. Show, using the FWL Theorem, that the OLS 
estimators of @; and 2 can be written as 


By a i Xo uly 
Bo| |0 X'M.X2 Xə My |’ 


where, as usual, M, is the matrix that takes deviations from the sample mean. 


Show, preferably using (2.36), that Px — Pı is an orthogonal projection 
matrix. That is, show that Px — Pı is symmetric and idempotent. Show 
further that 


Px — Pi = Pm, xp; 


where Pm, x, is the projection on to the span of M1 X2. This can be done 
most easily by showing that any vector in 8(M XQ) is invariant under the 
action of Px — P,, and that any vector orthogonal to this span is annihilated 
by Px — P}. 


Let ų be a vector of 1s, and let X be an n x 3 matrix, with full rank, of which 
the first column is +. What can you say about the matrix M,X? What can 
you say about the matrix P,X? What is M.Mx equal to? What is Pp) Mx 
equal to? 


Express the four seasonal variables, s;, i = 1,2,3,4, defined in (2.47), as 
functions of the constant ¿ and the three variables si, i = 1,2,3, defined 
in (2.50). 

Show that the full n-dimensional space E” is the span of the set of unit basis 


vectors ez, t = 1,...,n, where all the components of e; are zero except for 
the aoe? which is equal to 1. 


The file tbrate.data contains data for 1950:1 to 1996:4 for three series: rz, 

the interest rate on 90-day treasury bills, 7, the rate of inflation, and yz, the 

logarithm of real GDP. For the period 1950:4 to 1996:4, run the regression 
Art = 61 + Bame—1 + G3 Aye—-1 + Ba Are—-1 + B5Are_2 + ut, (2.71) 
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where A is the first-difference operator, defined so that Ar; = re —r¢— 1. Plot 
the residuals and fitted values against time. Then regress the residuals on 
the fitted values and on a constant. What do you learn from this second 
regression? Now regress the fitted values on the residuals and on a constant. 
What do you learn from this third regression? 


For the same sample period, regress Ar; on a constant, Ay;— 1, Ar¢—1, and 
Ar—2. Save the residuals from this regression, and call them ê+. Then regress 
Tt-1 on a constant, Ayz—1, Ar¢—1, and Ar+—2. Save the residuals from this 
regression, and call them ô. Now regress €; on ôs. How are the estimated 
coefficient and the residuals from this last regression related to anything that 
you obtained when you estimated regression (2.71)? 


Calculate the diagonal elements of the hat matrix for regression (2.71) and 
use them to calculate a measure of leverage. Plot this measure against time. 
On the basis of this plot, which observations seem to have unusually high 
leverage? 


Show that the t' residual from running regression (2.57) is 0. Use this fact 
to demonstrate that, as a result of omitting observation t, the t'e residual 
from the regression y = X6 + u changes by an amount 


ht 
1— he 


Ut 


Calculate a vector of “omit 1” residuals @? for regression (2.71). The t*™ ele- 
ment of @°? is the residual for the tt? observation calculated from a regression 
that uses data for every observation except the pE, Try to avoid running 185 
regressions in order to do this! Regress a” on the ordinary residuals ů. Is 
the estimated coefficient roughly the size you expected it to be? Would it be 
larger or smaller if you were to omit some of the high-leverage observations? 


Show that the leverage measure hz is the square of the cosine of the angle 
between the unit basis vector e¢ and its projection on to the span 8(X) of 
the regressors. 


Suppose the matrix X is 150 x 5 and has full rank. Let Px be the matrix 
that projects on to §(X) and let My = I — Px. What is Tr(Px)? What is 
Tr(Mx)? What would these be if X did not have full rank but instead had 
rank 3? 


Generate a figure like Figure 2.15 for yourself. Begin by drawing 100 observa- 
tions of a regressor x; from the N(0,1) distribution. Then compute and save 
the hz for a regression of any regressand on a constant and x+. Plot the points 
(a+, ht), and you should obtain a graph similar to the one in Figure 2.15. 


Now add one more observation, 4191. Start with 7191 = Z, the average value 
of the x+, and then increase x19; progressively until x19; = Z + 20. For each 
value of 7191, compute the leverage measure hij91. How does hio91 change 
as £101 gets larger? Why is this in accord with the result that h; = 1 if the 
regressors include the dummy variable e+? 
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The Statistical Properties of 
Ordinary Least Squares 


3.1 Introduction 


In the previous chapter, we studied the numerical properties of ordinary least 
squares estimation, properties that hold no matter how the data may have 
been generated. In this chapter, we turn our attention to the statistical prop- 
erties of OLS, ones that depend on how the data were actually generated. 
These properties can never be shown to hold numerically for any actual data 
set, but they can be proven to hold if we are willing to make certain as- 
sumptions. Most of the properties that we will focus on concern the first two 
moments of the least squares estimator. 


In Section 1.5, we introduced the concept of a data-generating process, or 
DGP. For any data set that we are trying to analyze, the DGP is simply 
the mechanism that actually generated the data. Most real DGPs for econ- 
omic data are probably very complicated, and economists do not pretend to 
understand every detail of them. However, for the purpose of studying the sta- 
tistical properties of estimators, it is almost always necessary to assume that 
the DGP is quite simple. For instance, when we are studying the (multiple) 
linear regression model 


y = XB +u, u ~ ID(0,c7), (3.01) 
we may wish to assume that the data were actually generated by the DGP 
Yt = X ßo + Ut; Ut ~N NID(0, oĉ). (3.02) 


The symbol “~” in (3.01) and (3.02) means “is distributed as.” We intro- 
duced the abbreviation IID, which means “independently and identically dis- 
tributed,” in Section 1.3. In the model (3.01), the notation IID(0, 07) means 
that the uz are statistically independent and all follow the same distribution, 
with mean 0 and variance o°. Similarly, in the DGP (3.02), the notation 
NID(0, 02) means that the u¢ are normally, independently, and identically 
distributed, with mean 0 and variance oĉ. In both cases, it is implicitly being 
assumed that the distribution of u; is in no way dependent on X;. 
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The differences between the regression model (3.01) and the DGP (3.02) may 
seem subtle, but they are important. A key feature of a DGP is that it 
constitutes a complete specification, where that expression means, as in Sec- 
tion 1.3, that enough information is provided for the DGP to be simulated on 
a computer. For that reason, in (3.02) we must provide specific values for the 
parameters 3 and o? (the zero subscripts on these parameters are intended 
to remind us of this), and we must specify from what distribution the error 
terms are to be drawn (here, the normal distribution). 


A model is defined as a set of data-generating processes. Since a model is a 
set, we will sometimes use the notation M to denote it. In the case of the 
linear regression model (3.01), this set consists of all DGPs of the form (3.01) 
in which the coefficient vector @ takes some value in R*, the variance o? is 
some positive real number, and the distribution of uz varies over all possible 
distributions that have mean 0 and variance 7. Although the DGP (3.02) 


evidently belongs to this set, it is considerably more restrictive. 


The set of DGPs of the form (3.02) defines what is called the classical normal 
linear model, where the name indicates that the error terms are normally 
distributed. The model (3.01) is larger than the classical normal linear model, 
because, although the former specifies the first two moments of the error 
terms, and requires the error terms to be mutually independent, it says no 
more about them, and in particular it does not require them to be normal. 
All of the results we prove in this chapter, and many of those in the next, 
apply to the linear regression model (3.01), with no normality assumption. 
However, in order to obtain some of the results in the next two chapters, it 
will be necessary to limit attention to the classical normal linear model. 


For most of this chapter, we assume that whatever model we are studying, 
the linear regression model or the classical normal linear model, is correctly 
specified. By this, we mean that the DGP that actually generated our data 
belongs to the model under study. A model is misspecified if that is not the 
case. It is crucially important, when studying the properties of an estimation 
procedure, to distinguish between properties which hold only when the model 
is correctly specified, and properties, like those treated in the previous chapter, 
which hold no matter what the DGP. We can talk about statistical properties 
only if we specify the DGP. 


In the remainder of this chapter, we study a number of the most important 
statistical properties of ordinary least squares estimation, by which we mean 
least squares estimation of linear regression models. In the next section, we 
discuss the concept of bias and prove that, under certain conditions, Â, the 
OLS estimator of 8, is unbiased. Then, in Section 3.3, we discuss the concept 
of consistency and prove that, under considerably weaker conditions, B is 
consistent. In Section 3.4, we turn our attention to the covariance matrix 
of Ê, and we discuss the concept of collinearity. This leads naturally to a 
discussion of the efficiency of least squares estimation in Section 3.5, in which 
we prove the famous Gauss-Markov Theorem. In Section 3.6, we discuss the 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


3.2 Are OLS Parameter Estimators Unbiased? 89 


estimation of g? and the relationship between error terms and least squares 
residuals. Up to this point, we will assume that the DGP belongs to the 
model being estimated. In Section 3.7, we relax this assumption and consider 
the consequences of estimating a model that is misspecified in certain ways. 
Finally, in Section 3.8, we discuss the adjusted R? and other ways of measuring 
how well a regression fits. 


3.2 Are OLS Parameter Estimators Unbiased? 


One of the statistical properties that we would like any estimator to have 
is that it should be unbiased. Suppose that 6 is an estimator of some para- 
meter 0, the true value of which is 0o. Then the bias of Ê is defined as E(0) —o, 
the expectation of 6 minus the true value of 0. If the bias of an estimator is 
zero for every admissible value of #9, then the estimator is said to be unbiased. 
Otherwise, it is said to be biased. Intuitively, if we were to use an unbiased 
estimator to calculate estimates for a very large number of samples, then the 
average value of those estimates would tend to the quantity being estimated. 
If their other statistical properties were the same, we would always prefer an 
unbiased estimator to a biased one. 


As we have seen, the linear regression model (3.01) can also be written, using 
matrix notation, as 


y=XB+u, u-~IID(0,07D), (3.03) 


where y and u are n-vectors, X is an n x k matrix, and @ is a k-vector. In 
(3.03), the notation IID(0, 071) is just another way of saying that each element 
of the vector u is independently and identically distributed with mean 0 and 
variance a”. This notation, which may seem a little strange at this point, is 
convenient to use when the model is written in matrix notation. Its meaning 
should become clear in Section 3.4. As we first saw in Section 1.5, the OLS 
estimator of 3 can be written as 


Ê= (XXY X'y. (3.04) 


In order to see whether this estimator is biased, we need to replace y by 
whatever it is equal to under the DGP that is assumed to have generated the 
data. Since we wish to assume that the model (3.03) is correctly specified, we 
suppose that the DGP is given by (3.03) with G = Bo. Substituting this into 
(3.04) yields 

Ê = (X™X)4X(XBp + u) 


3.05 
= bBo +(X'X)'X'u. ee) 

The expectation of the second line here is 
E(B) = bo + E((X'X) Xu). (3.06) 
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It is obvious that Ĝĝ will be unbiased if and only if the second term in (3.06) is 
equal to a zero vector. What is not entirely obvious is just what assumptions 
are needed to ensure that this condition will hold. 


Assumptions about Error Terms and Regressors 


In certain cases, it may be reasonable to treat the matrix X as nonstochastic, 
or fixed. For example, this would certainly be a reasonable assumption to 
make if the data pertained to an experiment, and the experimenter had chosen 
the values of all the variables that enter into X before y was determined. In 
this case, the matrix (X'X)~1X' is not random, and the second term in 
(3.06) becomes 


E((X'X) 'X'u) =(X'X)'tXTE(u). (3.07) 


If X really is fixed, it is perfectly valid to move the expectations operator 
through the factor that depends on X, as we have done in (3.07). Then, if we 
are willing to assume that E(w) = 0, we will obtain the result that the vector 
on the right-hand side of (3.07) is a zero vector. 


Unfortunately, the assumption that X is fixed, convenient though it may be 
for showing that B is unbiased, is frequently not a reasonable assumption 
to make in applied econometric work. More commonly, at least some of the 
columns of X correspond to variables that are no less random than y itself, 
and it would often stretch credulity to treat them as fixed. Luckily, we can 
still show that B is unbiased in some quite reasonable circumstances without 
making such a strong assumption. 


A weaker assumption is that the explanatory variables which form the columns 
of X are exogenous. The concept of exogeneity was introduced in Section 1.3. 
When applied to the matrix X, it implies that any randomness in the DGP 
that generated X is independent of the error terms u in the DGP for y. This 
independence in turn implies that 


E(u|X) =0. (3.08) 


In words, this says that the mean of the entire vector u, that is, of every one 
of the u+, is zero conditional on the entire matrix X. See Section 1.2 for a 
discussion of conditional expectations. Although condition (3.08) is weaker 
than the condition of independence of X and u, it is convenient to refer to 
(3.08) as an exogeneity assumption. 


Given the exogeneity assumption (3.08), it is easy to show that B is unbiased. 
It is clear that 
E((X'X) 'X'u|X) =0, (3.09) 


because the expectation of (X'X)~1X' conditional on X is just itself, and 
the expectation of u conditional on X is assumed to be 0; see (1.17). Then, 
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applying the Law of Iterated Expectations, we see that the unconditional 
expectation of the left-hand side of (3.09) must be equal to the expectation 
of the right-hand side, which is just 0. 


Assumption (3.08) is perfectly reasonable in the context of some types of data. 
In particular, suppose that a sample consists of cross-section data, in which 
each observation might correspond to an individual firm, household, person, 
or city. For many cross-section data sets, there may be no reason to believe 
that u is in any way related to the values of the regressors for any of the 
observations. On the other hand, suppose that a sample consists of time- 
series data, in which each observation might correspond to a year, quarter, 
month, or day, as would be the case, for instance, if we wished to estimate a 
consumption function, as in Chapter 1. Even if we are willing to assume that 
uz is in no way related to current and past values of the regressors, it must 
be related to future values if current values of the dependent variable affect 
future values of some of the regressors. Thus, in the context of time-series 
data, the exogeneity assumption (3.08) is a very strong one that we may often 
not feel comfortable in making. 


The assumption that we made in Section 1.3 about the error terms and the 
explanatory variables, namely, that 


is substantially weaker than assumption (3.08), because (3.08) rules out the 
possibility that the mean of u, may depend on the values of the regressors for 
any observation, while (3.10) merely rules out the possibility that it may de- 
pend on their values for the current observation. For reasons that will become 
apparent in the next subsection, we refer to (3.10) as a predeterminedness 
condition. Equivalently, we say that the regressors are predetermined with 
respect to the error terms. 


The OLS Estimator Can Be Biased 


We have just seen that the OLS estimator Ê is unbiased if we make assump- 
tion (3.08) that the explanatory variables X are exogenous, but we remarked 
that this assumption can sometimes be uncomfortably strong. If we are not 
prepared to go beyond the predeterminedness assumption (3.10), which it is 
rarely sensible to do if we are using time-series data, then we will find that G 
is, in general, biased. 


Many regression models for time-series data include one or more lagged vari- 
ables among the regressors. The first lag of a time-series variable that takes 
on the value z at time t is the variable whose value at t is z:-1. Similarly, 
the second lag of z has value z:~2, and the pt” lag has value zp. In some 
models, lags of the dependent variable itself are used as regressors. Indeed, 
in some cases, the only regressors, except perhaps for a constant term and 
time trend or dummy variables, are lagged dependent variables. Such mod- 
els are called autoregressive, because the conditional mean of the dependent 
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variable depends on lagged values of the variable itself. A simple example of 
an autoregressive model is 


y = bit + byi +u, u ~ HD(0,o°I). (3.11) 


Here, as usual, e is a vector of 1s, the vector y has typical element y+, the 
dependent variable, and the vector yı has typical element y:—1, the lagged 
dependent variable. This model can also be written, in terms of a typical 
observation, as 


y= Git Boy—-1 tur, uz ~ IID(0, o°). 


It is perfectly reasonable to assume that the predeterminedness condition 
(3.10) holds for the model (3.11), because this condition amounts to saying 
that E(u;) = 0 for every possible value of y;_1. The lagged dependent variable 
ye—-1 is then said to be predetermined with respect to the error term u. Not 
only is yz—-1 realized before uş, but its realized value has no impact on the 
expectation of us. However, it is clear that the exogeneity assumption (3.08), 
which would here require that E(w| yi) = 0, cannot possibly hold, because 
Yz-1 depends on wz-1, Ue-2, and so on. Assumption (3.08) will evidently 
fail to hold for any model in which the regression function includes a lagged 
dependent variable. 


To see the consequences of assumption (3.08) not holding, we use the FWL 
Theorem to write out (2 explicitly as 


Bo E (y M, yı) ty M,y. 
Here M, denotes the projection matrix I—v(e't)~ tel, which centers any vector 
it multiplies; recall (2.32). If we replace y by Biot + Gooyi + uU, where 810 and 
B20 are specific values of the parameters, and use the fact that M, annihilates 
the constant vector, we find that 


Bo = (uM, yY tyl M, (yi B20 + u) 


any (3.12) 
= Boo + (yr Miyi) yı Miu. 


This is evidently just a special case of (3.05). 


It is clear that Bo will be unbiased if and only if the second term in the second 
line of (3.12) has expectation zero. But this term does not have expectation 
zero. Because yı is stochastic, we cannot simply move the expectations op- 
erator, as we did in (3.07), and then take the unconditional expectation of u. 
Because E(w| yi) Æ 0, we also cannot take expectations conditional on yj, 
in the way that we took expectations conditional on X in (3.09), and then 
rely on the Law of Iterated Expectations. In fact, as readers are asked to 
demonstrate in Exercise 3.1, the estimator (> is biased. 
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It seems reasonable that, if Bo is biased, so must be A. The equivalent of the 
second line of (3.12) is 


By = bio + (My t) tM pU, (3.13) 


where the notation should be self-explanatory. Once again, because y; de- 
pends on u, we cannot employ the methods that we used in (3.07) or (3.09) 
to prove that the second term on the right-hand side of (3.13) has mean zero. 
In fact, it does not have mean zero, and (1 is consequently biased, as readers 
are also asked to demonstrate in Exercise 3.1. 


The problems we have just encountered when dealing with the autoregressive 
model (3.11) will evidently affect every regression model with random regres- 
sors for which the exogeneity assumption (3.08) does not hold. Thus, for all 
such models, the least squares estimator of the parameters of the regression 
function is biased. Assumption (3.08) cannot possibly hold when the regressor 
matrix X contains lagged dependent variables, and it probably fails to hold 
for most other models that involve time-series data. 


3.3 Are OLS Parameter Estimators Consistent? 


Unbiasedness is by no means the only desirable property that we would like 
an estimator to possess. Another very important property is consistency. A 
consistent estimator is one for which the estimate tends to the quantity being 
estimated as the size of the sample tends to infinity. Thus, if the sample size 
is large enough, we can be confident that the estimate will be close to the true 
value. Happily, the least squares estimator B will often be consistent even 
when it is biased. 


In order to define consistency, we have to specify what it means for the sam- 
ple size n to tend to infinity or, in more compact notation, n — oo. At first 
sight, this may seem like a very odd notion. After all, any given data set 
contains a fixed number of observations. Nevertheless, we can certainly imag- 
ine simulating data and letting n become arbitrarily large. In the case of a 
pure time-series model like (3.11), we can easily generate any sample size we 
want, just by letting the simulations run on for long enough. In the case of 
a model with cross-section data, we can pretend that the original sample is 
taken from a population of infinite size, and we can imagine drawing more and 
more observations from that population. Even in the case of a model with 
fixed regressors, we can think of ways to make n tend to infinity. Suppose that 
the original X matrix is of dimension m x k. Then we can create X matrices 
of dimensions 2m x k, 3m x k, 4m x k, and so on, simply by stacking as many 
copies of the original X matrix as we like. By simulating error vectors of the 
appropriate length, we can then generate y vectors of any length n that is an 
integer multiple of m. Thus, in all these cases, we can reasonably think of 
letting n tend to infinity. 
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Probability Limits 


In order to say what happens to a stochastic quantity that depends on n 
as n — oo, we need to introduce the concept of a probability limit. The 
probability limit, or plim for short, generalizes the ordinary concept of a limit 
to quantities that are stochastic. If a(y”) is some vector function of the 
random vector y”, and the plim of a(y”) as n — oo is ao, we may write 


plim a(y”) = ao. (3.14) 


n— oo 


We have written y” here, instead of just y, to emphasize the fact that y” 
is a vector of length n, and that n is not fixed. The superscript is often 
omitted in practice. In econometrics, we are almost always interested in taking 
probability limits as n — oo. Thus, when there can be no ambiguity, we will 
often simply use notation like plima(y) rather than more precise notation 
like that of (3.14). 


Formally, the random vector a(y”) tends in probability to the limiting random 
vector ao if, for all € > 0, 


lim Pr(|la(y”) — aol] < €) = 1. (3.15) 
Here ||- || denotes the Euclidean norm of a vector (see Section 2.2), which 


simplifies to the absolute value when its argument is a scalar. Condition 
(3.15) says that, for any specified tolerance level £, no matter how small, the 
probability that the norm of the discrepancy between a(y") and ao will be 
less than £ goes to unity as n — oo. 


Although the probability limit ao was defined above to be a random variable 
(actually, a vector of random variables), it may in fact be an ordinary non- 
random vector or scalar, in which case it is said to be nonstochastic. Many 
of the plims that we will encounter in this book are in fact nonstochastic. A 
simple example of a nonstochastic plim is the limit of the proportion of heads 
in a series of independent tosses of an unbiased coin. Suppose that y+ is a 
random variable equal to 1 if the coin comes up heads, and equal to 0 if it 
comes up tails. After n tosses, the proportion of heads is just 


If the coin really is unbiased, E(y;) = 1/2. Thus it should come as no surprise 
to learn that plim p(y”) = 1/2. Proving this requires a certain amount of 
effort, however, and we will therefore not attempt a proof here. For a detailed 
discussion and proof, see Davidson and MacKinnon (1993, Section 4.2). 


The coin-tossing example is really a special case of an extremely powerful 
result in probability theory, which is called a law of large numbers, or LLN. 
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Suppose that g is the sample mean of x+, t = 1,...,n, a sequence of random 
variables, each with expectation u. Then, provided the x; are independent 
(or at least, not too dependent), a law of large numbers would state that 


rae 1 
plim z = plim — yi =u. (3.16) 


n— o0 n— o0 
t=1 


In words, z has a nonstochastic plim which is equal to the common expectation 
of each of the x+. 


It is not hard to see intuitively why (3.16) is true under certain conditions. 
Suppose, for example, that the x, are IID, with variance o”. Then we see at 
once that 


E(z)=+ 5 E(z:)=ż =p, and 
i=) t=1 


Var(Z) = (4) ya = 1g, 
t=1 


Thus z has mean u and a variance which tends to zero as n — oo. In the 
limit, we expect that, on account of the shrinking variance, % will become a 
nonstochastic quantity equal to its expectation u. The law of large numbers 
assures us that this is the case. 


Another useful way to think about laws of large numbers is to note that, as 
n — oo, we are collecting more and more information about the mean of 
the x+, with each individual observation providing a smaller and smaller frac- 
tion of that information. Thus, eventually, the randomness in the individual 
x, cancels out, and the sample mean z converges to the population mean p. 
For this to happen, we need to make some assumption in order to prevent 
any one of the x; from having too much impact on z. The assumption that 
they are IID is sufficient for this. Alternatively, if they are not IID, we could 
assume that the variance of each x; is greater than some finite nonzero lower 
bound, but smaller than some finite upper bound. We also need to assume 
that there is not too much dependence among the x+ in order to ensure that 
the random components of the individual x; really do cancel out. 


There are actually many laws of large numbers, which differ principally in the 
conditions that they impose on the random variables which are being averaged. 
We will not attempt to prove any of these LLNs. Section 4.5 of Davidson and 
MacKinnon (1993) provides a simple proof of a relatively elementary law of 
large numbers. More advanced LLNs are discussed in Section 4.7 of that book, 
and, in more detail, in Davidson (1994). 


Probability limits have some very convenient properties. For example, sup- 
pose that {x°}, n = 1,...,00, is a sequence of random variables which 
has a nonstochastic plim £o as n — oo, and 7(x”) is a smooth function 
of x”. Then plimn(x”) = (x0). This feature of plims is one that is em- 
phatically not shared by expectations. When 7/(-) is a nonlinear function, 
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E(n(2)) Æ n(E(x)). Thus, it is often very easy to calculate plims in circum- 
stances where it would be difficult or impossible to calculate expectations. 


However, working with plims can be a little bit tricky. The problem is that 
many of the stochastic quantities we encounter in econometrics do not have 
probability limits unless we divide them by n or, perhaps, by some power of n. 
For example, consider the matrix X'X, which appears in the formula (3.04) 
for 3. Each element of this matrix is a scalar product of two of the columns 
of X, that is, two n-vectors. Thus it is a sum of n numbers. As n — oo, we 
would expect that, in most circumstances, such a sum would tend to infinity 
as well. Therefore, the matrix X'X will generally not have a plim. However, 
it is not at all unreasonable to assume that 


plim XTX = Sxrx, (3.17) 


N— Co 


where Syrty is a nonstochastic matrix with full rank k, since each element of 
the matrix on the left-hand side of (3.17) is now an average of n numbers: 


1 yT _ il ; ; 
(x x), =i 2 ee. 


In effect, when we write (3.17), we are implicitly making some assumption 
sufficient for a LLN to hold for the sequences generated by the squares of 
the regressors and their cross-products. Thus there should not be too much 
dependence between Xi Xij and XsiXsj for s Æ t, and the variances of these 
quantities should not differ too much as t and s vary. 


The OLS Estimator is Consistent 


We can now show that, under plausible assumptions, the least squares estima- 
tor B is consistent. When the DGP is a special case of the regression model 
(3.03) that is being estimated, we saw in (3.05) that 


Ê = Bot (X'X) HX "u. (3.18) 


To demonstrate that Ê is consistent, we need to show that the second term 
on the right-hand side here has a plim of zero. This term is the product of 
two matrix expressions, (X' X)! and X'u. Neither X'X nor X'u has 
a probability limit. However, we can divide both of these expressions by n 
without changing the value of this term, since n-n~! = 1. By doing so, we 
convert them into quantities that, under reasonable assumptions, will have 
nonstochastic plims. Thus the plim of the second term in (3.18) becomes 


=ł E 
( plim 1x'x) plim +XTu = (Sxtx) plim + X7u = 0. (3.19) 


n—> CO n—> oo n— CO 
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In writing the first equality here, we have assumed that (3.17) holds. To obtain 
the second equality, we start with assumption (3.10), which can reasonably be 
made even when there are lagged dependent variables among the regressors. 
This assumption tells us that E(X} u:| X) = 0, and the Law of Iterated 
Expectations then tells us that B(x ut) = 0. Thus, assuming that we can 
apply a law of large numbers, 


plim +x"u = plim — x uz = O. 


n— CoO n— co 


Together with (3.18), (3.19) gives us the result that @ is consistent. 


We have just seen that the OLS estimator B is consistent under consider- 
ably weaker assumptions about the relationship between the error terms and 
the regressors than were needed to prove that it is unbiased; compare (3.10) 
and (3.08). This may wrongly suggest that consistency is a weaker condition 
than unbiasedness. Actually, it is neither weaker nor stronger. Consistency 
and unbiasedness are simply different concepts. Sometimes, least squares 
estimators may be biased but consistent, for example, in models where X 
includes lagged dependent variables. In other circumstances, however, these 
estimators may be unbiased but not consistent. For example, consider the 
model 

= ĝı + b27 +u w ~ ID(0, 0°). (3.20) 


Since both regressors here are nonstochastic, the least squares estimates Ĝi 
and Bo are clearly unbiased. However, it is easy to see that Bo is not consistent. 
The problem is that, as n — ov, each observation provides less and less 
information about 32. This happens because the regressor 1/4 tends to zero, 
and hence varies less and less across observations as t becomes larger. As 
a consequence, the matrix Sx tx can be shown to be singular. Therefore, 
equation (3.19) does not hold, and the second term on the right-hand side of 
equation (3.18) does not have a probability limit of zero. 


The model (3.20) is actually rather a curious one, since By is consistent even 
though Bo is not. The reason By is consistent is that, as the sample size n 
gets larger, we obtain an amount of information about 6, that is roughly 
proportional to n. In contrast, because each successive observation gives us 
less and less information about (2, B2 is not consistent. 


An estimator that is not consistent is said to be inconsistent. There are 
two types of inconsistency, which are actually quite different. If an unbiased 
estimator, like G2 in the previous example, is inconsistent, it is so because 
it does not tend to any nonstochastic probability limit. In contrast, many 
inconsistent estimators do tend to nonstochastic probability limits, but they 
tend to the wrong ones. 


To illustrate the various types of inconsistency, and the relationship between 
bias and inconsistency, imagine that we are trying to estimate the population 
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mean, u, from a sample of data y%,t = 1,...,n. A sensible estimator would 
be the sample mean, y. Under reasonable assumptions about the way the 
ye are generated, y will be unbiased and consistent. Three not very sensible 
estimators are the following: 


t=1 
. eg 
fio = — Y yt, and 
n 
t=1 
: 09 <= 
jis = 0.0141 + —— De 


The first of these estimators, fi1, is biased but consistent. It is evidently equal 
to n/(n + 1) times y. Thus its mean is (n/(n + 1)), which tends to p as 
n — oo, and it will be consistent whenever y is. The second estimator, /i2, is 
clearly biased and inconsistent. Its mean is 1.014, since it is equal to 1.019, 
and it will actually tend to a plim of 1.01 as n — oo. The third estimator, f3, 
is perhaps the most interesting. It is clearly unbiased, since it is a weighted 
average of two estimators, yı and the average of y2 through yn, each of which 
is unbiased. The second of these two estimators is also consistent. However, 
fiz itself is not consistent, because it does not converge to a nonstochastic 
plim. Instead, it converges to the random quantity 0.994 + 0.01 y1. 


3.4 The Covariance Matrix of the OLS Parameter Estimates 


Although it is valuable to know that the least squares estimator B is either 
unbiased or, under weaker conditions, consistent, this information by itself is 
not very useful. If we are to interpret any given set of OLS parameter esti- 
mates, we need to know, at least approximately, how B is actually distributed. 
For purposes of inference, the most important feature of the distribution of 
any vector of parameter estimates is the matrix of its central second moments. 
This matrix is the analog, for vector random variables, of the variance of a 
scalar random variable. If b is any random vector, we will denote its matrix 
of central second moments by Var(b), using the same notation that we would 
use for a variance in the scalar case. Usage, perhaps somewhat illogically, 
dictates that this matrix should be called the covariance matrix, although 
the terms variance matrix and variance-covariance matrix are also sometimes 
used. Whatever it is called, the covariance matrix is an extremely important 
concept which comes up over and over again in econometrics. 


The covariance matrix Var(b) of a random k-vector b, with typical element b;, 
organizes all the central second moments of the b; into a k x k symmetric 
matrix. The i'® diagonal element of Var(b) is Var(b;), the variance of b;. The 
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ij‘ off-diagonal element of Var(b) is Cov(b;,b;), the covariance of b; and bj. 
The concept of covariance was introduced in Exercise 1.10. In terms of the 
random variables b; and b;, the definition is 


Cov(b;,b;) = E ((b; =B ys E(b;))). (3.21) 


Many of the properties of covariance matrices follow immediately from (3.21). 
For example, it is easy to see that, if i = j, Cov(b;, bj) = Var(b;). Moreover, 
since from (3.21) it is obvious that Cov(b;, bj) = Cov(b;,6;), Var(b) must be a 
symmetric matrix. The full covariance matrix Var(b) can be expressed readily 
using matrix notation. It is just 


Var(b) = E((b— E(b)) (b - E(b))"), (3.22) 


as is obvious from (3.21). An important special case of (3.22) arises when 
E(b) = 0. In this case, Var(b) = E(bb'). 


The special case in which Var(b) is diagonal, so that all the covariances 
are zero, is of particular interest. If b; and b; are statistically independent, 
Cov(b;, bj) = 0; see Exercise 1.11. The converse is not true, however. It is per- 
fectly possible for two random variables that are not statistically independent 
to have covariance 0; for an extreme example of this, see Exercise 1.12. 


The correlation between b; and b; is 
Cov(b;, b;) 


bi, bj = ie? í 
á } (Var (b;) Var (b;)) (3-23) 


It is often useful to think in terms of correlations rather than covariances, 
because, according to the result of Exercise 3.6, the former always lie between 
—1 and 1. We can arrange the correlations between all the elements of b 
into a symmetric matrix called the correlation matrix. It is clear from (3.23) 
that all the elements on the principal diagonal of this matrix will be 1. This 
demonstrates that the correlation of any random variable with itself equals 1. 


In addition to being symmetric, Var (b) must be a positive semidefinite matrix; 
see Exercise 3.5. In most cases, covariance matrices and correlation matrices 
are positive definite rather than positive semidefinite, and their properties 
depend crucially on this fact. 


Positive Definite Matrices 


A k x k symmetric matrix A is said to be positive definite if, for all nonzero 
k-vectors x, the matrix product x'Aa, which is just a scalar, is positive. The 
quantity «'Az is called a quadratic form. A quadratic form always involves 
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a k-vector, in this case x, and a k x k matrix, in this case A. By the rules of 
matrix multiplication, 


k k 
rAz = >. > pj Aij. (3.24) 


i=1 j=1 


If this quadratic form can take on zero values but not negative values, the 
matrix A is said to be positive semidefinite. 


Any matrix of the form B'B is positive semidefinite. To see this, observe 
that B'B is symmetric and that, for any nonzero x, 


z'B' Bax = (Bz)'(Bz) =||Bza||? > 0. (3.25) 


This result can hold with equality only if Bx = 0. But, in that case, since 
x # 0, the columns of B are linearly dependent. We express this circumstance 
by saying that B does not have full column rank. Note that B can have full 
rank but not full column rank if B has fewer rows than columns, in which case 
the maximum possible rank equals the number of rows. However, a matrix 
with full column rank necessarily also has full rank. When B does have full 
column rank, it follows from (3.25) that B'B is positive definite. Similarly, if 
A is positive definite, then any matrix of the form B'AB is positive definite 
if B has full column rank and positive semidefinite otherwise. 


It is easy to see that the diagonal elements of a positive definite matrix must all 
be positive. Suppose this were not the case and that, say, A22 were negative. 
Then, if we chose x to be the vector es, that is, a vector with 1 as its second 
element and all other elements equal to 0 (see Section 2.6), we could make 
z'Aa <0. From (3.24), the quadratic form would just be e2 Aes = As < 0. 
For a positive semidefinite matrix, the diagonal elements may be 0. Unlike 
the diagonal elements, the off-diagonal elements of A may be of either sign. 


A particularly simple example of a positive definite matrix is the identity 
matrix, I. Because all the off-diagonal elements are zero, (3.24) tells us that 


a quadratic form in I is 
k 
x'Ix = y z$, 
i=1 


which is certainly positive for all nonzero vectors æ. The identity matrix was 
used in (3.03) in a notation that may not have been clear at the time. There 
we specified that u ~ IID(0, o°I). This is just a compact way of saying that 
the vector of error terms u is assumed to have mean vector 0 and covariance 
matrix o7I. 


A positive definite matrix cannot be singular, because, if A is singular, there 
must exist a nonzero æ such that Aæ = 0. But then «'Aa = 0 as well, which 
means that A is not positive definite. Thus the inverse of a positive definite 
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matrix always exists. It too is a positive definite matrix, as readers are asked 
to show in Exercise 3.7. 


There is a sort of converse of the result that any matrix of the form B'B, 
where B has full column rank, is positive definite. It is that, if A is a symmet- 
ric positive definite k x k matrix, there always exist full-rank k x k matrices B 
such that A = B'B. For any given A, such a B is not unique. In particular, 
B can be chosen to be symmetric, but it can also be chosen to be upper or 
lower triangular. Details of a simple algorithm (Crout’s algorithm) for finding 
a triangular B can be found in Press et al. (1992a, 1992b). 


The OLS Covariance Matrix 


The notation we used in the specification (3.03) of the linear regression model 
can now be understood in terms of the covariance matrix of the error terms, 
or the error covariance matrix. If the error terms are IID, they all have the 
same variance o7, and the covariance of any pair of them is zero. Thus the 
covariance matrix of the vector u is o7I, and we have 


Var(u) = E(uu!) = 071. (3.26) 


Notice that this result does not require the error terms to be independent. It 
is required only that they all have the same variance and that the covariance 
of each pair of error terms is zero. 


If we assume that X is exogenous, we can now calculate the covariance matrix 
of @ in terms of the error covariance matrix (3.26). To do this, we need to 
multiply the vector B — Bo by itself transposed. From (3.05), we know that 


Ê- Bo = (XTX) Xu. 
By (3.22), under the assumption that is unbiased, Var (Ê) is the expectation 
of the k x k matrix 
(Ê — Bo)(B — Bo) = (XXY) X uu X (XTX) (3.27) 


Taking this expectation, conditional on X, and using (3.26) with the specific 
value o? for the covariance matrix of the error terms, yields 


(XTX) X'R(uul) X(XTX) 1 = (XTX) 1X LX (XTX)! 
Cx XXX (XX) 
(ATX, 


= 0 


Thus we conclude that 
Var (8) = o2 (XTX). (3.28) 


This is the standard result for the covariance matrix of B under the assumption 
that the data are generated by (3.01) and that 8 is an unbiased estimator. 
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Precision of the Least Squares Estimates 


Now that we have an expression for Var(@), we can investigate what deter- 
mines the precision of the least squares coefficient estimates Ê. There are 
really only three things that matter. The first of these is o2, the true variance 
of the error terms. Not surprisingly, Var() is proportional to 02. The more 
random variation there is in the error terms, the more random variation there 


is in the parameter estimates. 


The second thing that affects the precision of B is the sample size, n. It is 
illuminating to rewrite (3.28) as 


Var (3) = (408) (xx) (3.29) 


If we make the assumption (3.17), the second factor on the right-hand side of 
(3.29) will not vary much with the sample size n, at least not if n is reasonably 
large. In that case, the right-hand side of (3.29) will be roughly proportional 
to 1/n, because the first factor is precisely proportional to 1/n. Thus, if we 
were to double the sample size, we would expect the variance of B to be 
roughly halved and the standard errors of the individual 8; to be divided 


by V2. 


As an example, suppose that we are estimating a regression model with just a 
constant term. We can write the model as y = 13, +u, where z is an n—vector 
of ones. Plugging in e for X in (3.04) and (3.28), we find that 


e LS a ana 
Var(,) = 02 (e's)! = 402. 


Thus, in this particularly simple case, the variance of the least squares esti- 
mator is exactly proportional to 1/y. 


The third thing that affects the precision of B is the matrix X. Suppose that 
we are interested in a particular coefficient which, without loss of generality, 
we may call 61. Then, if B2 denotes the (k — 1)-vector of the remaining 
coefficients, we can rewrite the regression model (3.03) as 


Y = ©), + X2ßb2 + U, (3.30) 
where X has been partitioned into x; and Xə to conform with the partition 
of B. By the FWL Theorem, regression (3.30) will yield the same estimate of 
(31 as the FWL regression 

Moy = Məxıßı + residuals, 
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where, as in Section 2.4, Mə = I — Xo(X_) Xo) LX! This estimate is 


e 
7 xı Moy 
p 


1s ; 
xı Mə £1 


and, by a calculation similar to that leading to (3.28), its variance is 


o2 (ay'Mza1) = (3.31) 


Thus Var( B1) is equal to the variance of the error terms divided by the squared 
length of the vector Moa}. 


The intuition behind (3.31) is simple. How much information the sample gives 
us about (1 is proportional to the squared Euclidean length of the vector 
Mo2,, which is the denominator of the right-hand side of (3.31). When 
|| M221|| is big, either because n is large or because at least some elements of 
Moa, are large, 3; will be relatively precise. When || Mz2x1|| is small, either 
because n is small or because all the elements of Moa, are small, (; will be 
relatively imprecise. 


The squared Euclidean length of the vector Moa is just the sum of squared 
residuals from the regression 


zı = Xoc + residuals. (3.32) 


Thus the variance of A, expression (3.31), is proportional to the inverse of the 
sum of squared residuals from regression (3.32). When æ; is well explained 
by the other columns of X, this SSR will be small, and the variance of (3; will 
consequently be large. When 21 is not well explained by the other columns 
of X, this SSR will be large, and the variance of (, will consequently be small. 


As the above discussion makes clear, the precision with which ( is estimated 
depends on Xə just as much as it depends on a;. Sometimes, if we just 
regress y on a constant and xı, we may obtain what seems to be a very 
precise estimate of 81, but if we then include some additional regressors, the 
estimate becomes much less precise. The reason for this is that the additional 
regressors do a much better job of explaining æı in regression (3.32) than does 
a constant alone. As a consequence, the length of Moa, is much less than the 
length of M,x,. This type of situation is sometimes referred to as collinearity, 
or multicollinearity, and the regressor x, is said to be collinear with some of 
the other regressors. This terminology is not very satisfactory, since, if a 
regressor were collinear with other regressors in the usual mathematical sense 
of the term, the regressors would be linearly dependent. It would be better to 
speak of approximate collinearity, although econometricians seldom bother 
with this nicety. Collinearity can cause difficulties for applied econometric 
work, but these difficulties are essentially the same as the ones caused by 
having a sample size that is too small. In either case, the data simply do not 
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contain enough information to allow us to obtain precise estimates of all the 
coefficients. 


The covariance matrix of Ê, expression (3.28), tells us all that we can possibly 
know about the second moments of Ĝĝ. In practice, of course, we will rarely 
know (3.28), but we can estimate it by using an estimate of o. How to 
obtain such an estimate will be discussed in Section 3.6. Using this estimated 
covariance matrix, we can then, if we are willing to make some more or less 
strong assumptions, make exact or approximate inferences about the true 
parameter vector Go. Just how we can do this will be discussed at length in 
Chapters 4 and 5. 


Linear Functions of Parameter Estimates 


The covariance matrix of Ê can be used to calculate the variance of any linear 
(strictly speaking, affine) function of Ê. Suppose that we are interested in 
the variance of 4, where y = w'G, 4 = wf, and w is a k-vector of known 
coefficients. By choosing w appropriately, we can make y equal to any one 
of the @;, or to the sum of the (;, or to any linear combination of the 6; in 
which we might be interested. For example, if y = 33; — G4, w would be a 
vector with 3 as the first element, —1 as the fourth element, and 0 for all the 
other elements. 


It is easy to show that 
Var (4) = w! Var(B)w = of w'(X'X) tw. (3.33) 
This result can be obtained as follows. By (3.22), 
Var(w'ĝ) = E(w'(ĝ — 80) (8 — Bo)'w) 
= w'E((8 — Bo)(8 — Bo)" ) w 
= w"(o3(XTX))w, 


from which (3.33) follows immediately. Notice that, in general, the variance 
of 7 depends on every element of the covariance matrix of 8; this is made 
explicit in expression (3.68), which readers are asked to derive in Exercise 3.10. 
Of course, if some elements of w are equal to 0, Var(*+) will not depend on 
the corresponding rows and columns of og (X' XY-t. 


It may be illuminating to consider the special case used as an example above, 
in which y = 30; — (4. In this case, the result (3.33) implies that 


Var (4) = w? Var (31) +wi Var (34) + 2w,w4 Cov( Ay, G1) 
= 9Var (1) + Var (ĝa) — 6 Cov(ĝ1, Bs). 


Notice that the variance of 7 depends on the covariance of By and Ba as well 
as on their variances. If this covariance is large and positive, Var(7) may be 
small, even if Var(/,) and Var(4) are both large. 
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The Variance of Forecast Errors 


The variance of the error associated with a regression-based forecast can be 
obtained by using the result (3.33). Suppose we have computed a vector of 
OLS estimates Ê and wish to use them to forecast ys, for s not in 1,...,n, 
using an observed vector of regressors Xs. Then the forecast of ys will simply 
be X,@. For simplicity, let us assume that @ is unbiased, which implies that 
the forecast itself is unbiased. Therefore, the forecast error has mean zero, 
and its variance is 


E(ys — Xs)? = E(Xsbo + us — X53)? 
E(u?) + E(Xsbo — X58)” (3.34) 
= o? + Var(X. ĝ). 


The first equality here depends on the assumption that the regression model 
is correctly specified, the second depends on the assumption that the error 
terms are serially uncorrelated, which ensures that E(us X,3) = 0, and the 
third uses the fact that B is assumed to be unbiased. 


Using the result (3.33), and recalling that X, is a row vector, we see that the 
last line of (3.34) is equal to 


o? + X,Var(8)X,) = 0? + o X (XXXS. (3.35) 


Thus we find that the variance of the forecast error is the sum of two terms. 
The first term is simply the variance of the error term us. If we knew the true 
value of 6, this would be the variance of the forecast error. The second term, 
which makes the forecast error larger than o2, arises because we are using the 
estimate @ instead of the true parameter vector Bo. It can be thought of as 
the penalty we pay for our ignorance of 8B. Of course, the result (3.35) can 
easily be generalized to the case in which we are forecasting a vector of values 
of the dependent variable; see Exercise 3.16. 


3.5 Efficiency of the OLS Estimator 


One of the reasons for the popularity of ordinary least squares is that, under 
certain conditions, the OLS estimator can be shown to be more efficient than 
many competing estimators. One estimator is said to be more efficient than 
another if, on average, the former yields more accurate estimates than the 
latter. The reason for the terminology is that an estimator which yields more 
accurate estimates can be thought of as utilizing the information available in 
the sample more efficiently. 


For a scalar parameter, the accuracy of an estimator is often taken to be 
proportional to the inverse of its variance, and this is sometimes called the 
precision of the estimator. For an estimate of a parameter vector, the precision 
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matrix is defined as the inverse of the covariance matrix of the estimator. For 
scalar parameters, one estimator of the parameter is said to be more efficient 
than another if the precision of the former is larger than that of the latter. 
For parameter vectors, there is a natural way to generalize this idea. Suppose 
that @ and B are two unbiased estimators of a k-vector of parameters 3, with 
covariance matrices Var (3) and Var (3), respectively. Then, if efficiency is 
measured in terms of precision, B is said to be more efficient — B if and 
only if the difference between their precision matrices, Var(3)~! — Var (gy) ; 
is a nonzero positive semidefinite matrix. 


Since it is more usual to work in terms of variance than precision, it is conven- 
ient to express the efficiency condition directly in terms of covariance matrices. 
As readers are asked to show in Exercise 3.8, if A and B are positive definite 
matrices of the same dimensions, then the matrix A — B is positive semidef- 
inite if and only if B~' — A`! is positive semidefinite. Thus the efficiency 
condition expressed above in terms of precision matrices is equivalent to say- 
ing that Ê is more efficient than Ĝ if and only if Var(ĝ)— Var (Ê) is a nonzero 
positive semidefinite matrix. 


If B is more efficient than @ in this sense, then every individual parameter in 
the vector Ø, and every linear combination of those parameters, is estimated 
at least as efficiently by using 3 as by using 3. Consider an arbitrary linear 
combination of the parameters in 8, say y = w!G, for any k-vector w that 
we choose. As we saw in the preceding section, Var(4) = w' Var(3) w, and 
similarly for Var(7). Therefore, the difference between Var(7) and Var(Ẹ) is 


w! Var(3) w — w! Var(3)w = w! (Var(ĝ) — Var(ĝ))w. (3.36) 


The right-hand side of (3.36) must be either positive or zero whenever the 
matrix Var() — Var(ĝ) is positive semidefinite. Thus, if Ê is a more efficient 
estimator than 8, we can be sure that ¥ will be estimated with less variance 
than ¥. In practice, when one estimator is more efficient than another, the dif- 
ference between the covariance matrices is very often positive definite. When 
that is the case, every parameter or linear combination of parameters will be 
estimated more efficiently using B than using B. 


We now let Ê, as usual, denote the vector of OLS parameter estimates (3.04). 
As we are about to show, this estimator is more efficient than any other 
linear unbiased estimator. In section 3.3, we discussed what it means for an 
estimator to be unbiased, but we have not yet discussed what it means for 
an estimator to be linear. It simply means that we can write the estimator 
as a linear (affine) function of y, the vector of observations on the dependent 
variable. It is clear that B itself is a linear estimator, because it is equal to 
the matrix (X'X)~1X7 times the vector y. 


If 8 now denotes any linear estimator that is not the OLS estimator, we can 
always write - 
B = Ay = (X™X)'XTy+ Cy, (3.37) 
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where A and C are k x n matrices that depend on X. The first equality here 
just says that @ is a linear estimator. To obtain the second equality, we make 
the definition 

Ca=A= xxx", (3.38) 


So far, least squares is the only estimator for linear regression models that 
we have encountered. Thus it may be difficult to imagine what kind of esti- 
mator B might be. In fact, there are many estimators of this type, including 
generalized least squares estimators (Chapter 7) and instrumental variables 
estimators (Chapter 8) An alternative way of writing the class of linear unbi- 
ased estimators is explored in Exercise 3.17. 


The principal theoretical result on the efficiency of the OLS estimator is called 
the Gauss-Markov Theorem. An informal way of stating this theorem is to 
say that 8 is the best linear unbiased estimator, or BLUE for short. In other 
words, the OLS estimator is more efficient than any other linear unbiased 
estimator. 
Theorem 3.1. (Gauss-Markov Theorem) 
If it is assumed that E(u| X) = 0 and E(wu'| X) = o°I in the 
linear regression model (3.03), then the OLS estimator 8 is more 
efficient than any other linear unbiased estimator G, in the sense 


~ A 


that Var(@) — Var(@) is a positive semidefinite matrix. 


Proof: We assume that the DGP is a special case of (3.03), with parameters 
Bo and of. Substituting for y in (3.37), we find that 


GB = A(XBp + u) = AXßo + Au. (3.39) 


Since we want B to be unbiased, we require that the expectation of the right- 
most expression in (3.39), conditional on X, should be Bo. The second term in 
that expression has conditional mean 0, and so the first term must have con- 
ditional mean (Jo. This will be the case for all Bo if and only if AX = I, the 
k x k identity matrix. From (3.38), this condition is equivalent to CX = O. 
Thus requiring 3 to be unbiased imposes a strong condition on the matrix C. 


The unbiasedness condition that CX = O implies that Cy = Cu. Since, 
from (3.37), Cy = B — B, this makes it clear that 8B — B has conditional mean 
zero. The unbiasedness condition also implies that the covariance matrix of 
B — B and B is a zero matrix. To see this, observe that 


E((8 — 8o)(6 — B)") = E((X'X) X uuc") 
= (X'X)1X'o?IC" (3.40) 
=x XY IXO = 0. 


Consequently, equation (3.37) says that the unbiased linear estimator B is 
equal to the least squares estimator 8B plus a random component Cy which 
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has mean zero and is uncorrelated with B. The random component simply 
adds noise to the efficient estimator @. This makes it clear that 6 is more 
efficient than 3. To complete the proof, we note that 


Var (3) = Var (8 +(B- B)) 
= Var (8 + Cy) (3.41) 


A 


= Var( 6) + Var(Cy), 


because, from (3.40), the covariance of B and C'y is zero. Thus the difference 


between Var(@) and Var(@) is Var(C'y). Since it is a covariance matrix, this 
difference is necessarily positive semidefinite. a 


We will encounter many cases in which an inefficient estimator is equal to 
an efficient estimator plus a random variable that has mean zero and is un- 
correlated with the efficient estimator. The zero correlation ensures that the 
covariance matrix of the inefficient estimator is equal to the covariance matrix 
of the efficient estimator plus another matrix that is positive semidefinite, as 
in the last line of (3.41). If the correlation were not zero, this sort of proof 
would not work. Observe that, because everything is done in terms of second 
moments, the Gauss-Markov Theorem does not require any assumption about 
the normality of the error terms. 


The Gauss-Markov Theorem that the OLS estimator is BLUE is one of the 
most famous results in statistics. However, it is important to keep in mind 
the limitations of this theorem. The theorem applies only to a correctly speci- 
fied model with error terms that are homoskedastic and serially uncorrelated. 
Moreover, it does not say that the OLS estimator Ê is more efficient than 
every imaginable estimator. Estimators which are nonlinear and/or biased 
may well perform better than ordinary least squares. 


3.6 Residuals and Error Terms 


The vector of least squares residuals, u = y — XB, is easily calculated once we 
have obtained @. The numerical properties of & were discussed in Section 2.3. 
These properties include the fact that ù is orthogonal to X@ and to every 
vector that lies in S( X). In this section, we turn our attention to the statistical 
properties of ù as an estimator of u. These properties are very important, 
because we will want to use ù for a number of purposes. In particular, we 
will want to use it to estimate a7, the variance of the error terms. We need 
an estimate of a? if we are to obtain an estimate of the covariance matrix 
of B. As we will see in later chapters, the residuals can also be used to test 
some of the strong assumptions that are often made about the distribution 
of the error terms and to implement more sophisticated estimation methods 
that require weaker assumptions. 
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The consistency of Ê implies that & > u as n — oo, but the finite-sample 
properties of w differ from those of u. As we saw in Section 2.3, the vector of 
residuals w is what remains after we project the regressand y off S(X). If we 
assume that the DGP belongs to the model we are estimating, as the DGP 
(3.02) belongs to the model (3.01), then 


Mxy = Mx Xßo + Mxu = Mxu. 


The first term in the middle expression here vanishes because Mx annihilates 
everything that lies in §(X). The statistical properties of & as an estimator 
of u follow directly from the fact that @ = Mxu when the model (3.01) is 
correctly specified. 


Each of the residuals is equal to a linear combination of every one of the 
error terms. Consider a single row of the matrix product w = Mxu. Since 
the product has dimensions n x 1, this row has just one element, and this 
element is one of the residuals. Recalling the result on partitioned matrices in 
Exercise 1.14, which allows us to select rows of a matrix product by selecting 
that row of the leftmost factor, we can write the t*™ residual as 


& =u, — X(X'X) tX 'u 


=u Y X,(XTX) Xu. (3.42) 


s=1 


Thus, even if each of the error terms u is independent of all the other error 
terms, as we have been assuming, each of the û, will not be independent of 
all the other residuals. In general, there will be some dependence between 
every pair of residuals. However, this dependence will generally diminish as 
the sample size n increases. 


Let us now assume that E(u| X) = 0. This is assumption (3.08), which we 
made in Section 3.2 in order to prove that B is unbiased. According to this 
assumption, E(u; |X) = 0 for all t. All the expectations we will take in the 
remainder of this section will be conditional on X. Since, by (3.42), ûs is 
just a linear combination of all the u+, the expectation of ti conditional on 
X must be zero. Thus, in this respect, the residuals tu, behave just like the 
error terms Uz. 


In other respects, however, the residuals do not have the same properties as 
the error terms. Consider Var(i;), the variance of i. Since E(û+) = 0, this 
variance is just E(a?). As we saw in Section 2.3, the Euclidean length of the 
vector of least squares residuals, ù, is always smaller than that of the vector of 
residuals evaluated at any other value, u(@). In particular, @ must be shorter 
than the vector of error terms u = u(39). Thus we know that ||a@||? < Ijul]? 
This implies that E(||a@||?) < E(||u||?). If, as usual, we assume that the error 
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variance is o@ under the true DGP, we see that 
Y varao) =) BaD) = B(o a2) = ECA) 
t=1 t=1 
< E(\jul?) = E( Sou?) = DEUD = nok. 


t=1 


This suggests that, at least for most observations, the variance of ûs must 
be less than o. In fact, we will see that Var(û+) is less than o@ for every 


observation. 


The easiest way to calculate the variance of û, is to calculate the covariance 
matrix of the entire vector ù: 


Var(t) = Var(Mxu) = E(Mxuu' Mx) 
= MxE(uu')Mx = Mx Var(u)Mx (3.43) 
= Mx (061) Mx = 03MxMx = of Mx. 


The second equality in the first line here uses the fact that Mx u has mean 0. 
The third equality in the last line uses the fact that Mx is idempotent. From 
the result (3.43), we see immediately that, in general, E(ûrûs) Æ 0 for t # s. 
Thus, even though the original error terms are assumed to be uncorrelated, 
the residuals will not be uncorrelated. 


From (3.43), it can also be seen that the residuals will not have constant 
variance, and that this variance will always be smaller than o@. Recall from 
Section 2.6 that h; denotes the t'® diagonal element of the projection matrix 
Px. Thus a typical diagonal element of Mx is 1 — h4. Therefore, it follows 
from (3.43) that 

Var (ût) = E(a?) = (1 — hi) of. (3.44) 


Since 0 < 1— hy < 1, (3.44) implies that E(&?) will always be smaller than 
cĉ. Just how much smaller will depend on h+. It is clear that high-leverage 
observations, for which h; is relatively large, will have residuals with smaller 
variance than low-leverage observations, for which h, is relatively small. This 
makes sense, since high-leverage observations have more effect on the para- 
meter values. As a consequence, the residuals for high-leverage observations 
tend to be shrunk more, relative to the error terms, than the residuals for 
low-leverage observations. 


Estimating the Variance of the Error Terms 


The method of least squares provides estimates of the regression coefficients, 
but it does not directly provide an estimate of o?, the variance of the error 
terms. The method of moments suggests that we can estimate g? by using the 
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corresponding sample moment. If we actually observed the uz, this sample 
moment would be 


1 
= Sul. (3.45) 
t=1 


We do not observe the u+, but we do observe the ti. Thus the simplest possible 


MM estimator is 
yi (3.46) 


This estimator is just the average of n squared residuals. It can be shown to 
be consistent; see Exercise 3.13. However, because each squared residual has 
expectation less than GE, by (3.44), 6? must be biased downward. 


a 


It is easy to calculate the bias of G?. We saw in Section 2.6 that 37}, hi = K. 
Therefore, from (3.44) and (3.46), 


mn 1 n—k 
E(6?) =+ 5) > B(a? =i (1 — hi)of = a o0: (3.47) 


Since @ = Myu and Mx is idempotent, the sum of squared residuals is just 
u'Mxu. The result (3.47) implies that 


E(u'Mxu) = E(SSR(@)) = E B i?) = (n—k)os. (3.48) 


t=1 


Readers are asked to show this in a different way in Exercise 3.14. Notice, 
from (3.48), that adding one more regressor has exactly the same effect on 
the expectation of the SSR as taking away one observation. 


The result (3.47) suggests another MM estimator which will be unbiased: 


a i (3.49) 


The only difference between G? and s? is that the former divides the SSR by n 
and the latter divides it by n—k. As a result, s? will be unbiased whenever 3 
is. Ideally, if we were able to observe the error terms, our MM estimator would 
be (3.45), which would be unbiased. When we replace the error terms u; by 
the residuals ti, we introduce a downward bias. Dividing by n — k instead of 
by n eliminates this bias. 


Virtually all OLS regression programs report s? as the estimated variance of 
the error terms. However, it is important to remember that, even though s? 
provides an unbiased estimate of o7, s itself does not provide an unbiased 
estimate of o, because taking the square root of s? is a nonlinear operation. If 
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we replace aĝ by s? in expression (3.28), we can obtain an unbiased estimate 
of Var(9), _ 
Var (3) = 2? (XT XY!. (3.50) 


This is the standard estimate of the covariance matrix of the OLS parameter 
estimates under the assumption of IID errors. 


3.7 Misspecification of Linear Regression Models 


Up to this point, we have assumed that the DGP belongs to the model that 
is being estimated, or, in other words, that the model is correctly specified. 
This is obviously a very strong assumption indeed. It is therefore important 
to know something about the statistical properties of B when the model is not 
correctly specified. In this section, we consider a simple case of misspecifica- 
tion, namely, underspecification. In order to understand underspecification 
better, we begin by discussing its opposite, overspecification. 


Overspecification 


A model is said to be overspecified if some variables that rightly belong to the 
information set Q;, but do not appear in the DGP, are mistakenly included 
in the model. Overspecification is not a form of misspecification. Including 
irrelevant explanatory variables in a model makes the model larger than it 
need have been, but, since the DGP remains a special case of the model, there 
is no misspecification. Consider the case of an overspecified linear regression 
model. Suppose that we estimate the model 


y = XB+ Zy+u, u-~TID(0,071), (3.51) 

when the data are actually generated by 
y = Xßo +u, u~ IID(0,@I). (3.52) 
It is assumed that X, and Z;, the tt” rows of X and Z, respectively, belong to 
Q. Recall the discussion of information sets in Section 1.3. The overspecified 


model (3.51) is not misspecified, since the DGP (3.52) is a special case of it, 
with 0 = Bo, y = 0, and o° = og. 


Suppose now that we run the linear regression (3.51). By the FWL Theorem, 
the estimates 8 from (3.51) are the same as those from the regression 


Mzy = MzXß + residuals, 
where, as usual, Mz = I — Z(Z'Z)-1Z'. Thus we see that 
B = (X'Mz XY 'X'Mzy. (3.53) 
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Since 8 is part of the OLS estimator of a correctly specified model, it should 
be unbiased. Indeed, if we replace y by X@o + u, we find from (3.53) that 


B = Bo + (X'MzX) 1 X'Mzu. (3.54) 


The conditional expectation of the second term on the right-hand side of 
(3.54) is 0, provided we take expectations conditional on Z as well as on X; 
see Section 3.2. Since Z; is assumed to belong to Q4, it is perfectly legitimate 
to do this. 


If we had estimated (3.51) subject to the valid restriction that y = 0, we 
would have obtained the OLS estimate 3, expression (3.04), which is unbiased 
and has covariance matrix (3.28). We see that both @ and 8 are unbiased 
estimators, linear in y. Both are OLS estimators, and so it seems that we 
should be able to apply the Gauss-Markov Theorem to both of them. This is 
in fact correct, but we must be careful to apply the theorem in the context of 
the appropriate model for each of the estimators. 


For Ê, the appropriate model is the restricted model, 
y=Xß+u, u-~TID(0,o7I), (3.55) 


in which the restriction y = 0 is explicitly imposed. Provided this restriction 
is correct, as it will be if the true DGP takes the form (3.52), B must be more 
efficient than any other linear unbiased estimator of 8. Thus we should find 
that the matrix Var(3) — Var (Â) is positive semidefinite. 

For Ø, the appropriate model is the unrestricted model (3.51). In this context, 
the Gauss-Markov Theorem says that, when we do not know the true value 
of y, B is the best linear unbiased estimator of 6. It is important to note here 
that B is not an unbiased estimator of 3 for the unrestricted model, and so 
it cannot be included in the class of estimators covered by the Gauss-Markov 
Theorem for that model. We will make this point more fully in the next 
subsection, when we discuss underspecification. 


It is illuminating to check these consequences of the Gauss-Markov Theorem 
explicitly. From equation (3.54), it follows that 


Var(3) = E((B — Bo)(B — Bo)") 
= (X'MzX) |X 'MzE(uu')MzX(X'MzXx)! 
= 0)(X'MzX)'X'MzIMzX(X'MzX)' 
= 0)(X'MzX)". 


(3.56) 


The situation is clear in the case in which there is only one parameter, /, 
corresponding to a single regressor, x. Since Mz is a projection matrix, the 
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Euclidean length of Mza must be smaller (or at least, no larger) than the 
Euclidean length of a; recall (2.28). Thus x'Mzzæ < x'x, which implies that 


oo(a'Mzax)' > olex) t. (3.57) 


The inequality in (3.57) will almost always hold strictly. The only exception 
is the special case in which æ lies in $+(Z), which implies that the regression 
of x on Z has no explanatory power at all. 


In general, we wish to show that Var (3) — Var (3) is a positive semidefinite 
matrix. As we saw in Section 3.5, this is equivalent to showing that the matrix 
Var(3)~' — Var()~! is positive semidefinite. A little algebra shows that 


X'X — X'MzX = X' (I - Mz)X 
= X'PzX (3.58) 
= (Pz X)'PzX. 


Since X'X—X'MzX can be written as the transpose of a matrix times itself, 
it must be positive semidefinite. Dividing by o@ gives the desired result. 


We have established that the OLS estimator of @ in the overspecified regres- 
sion model (3.51) is at most as efficient as the OLS estimator in the restricted 
model (3.55), provided the restrictions are true. Therefore, adding additional 
variables that do not really belong in a model normally leads to less accurate 
estimates. Only in certain very special cases will there be no loss of efficiency. 
In such cases, the covariance matrices of B and B must be the same, which 
implies that the matrix difference computed in (3.58) must be zero. 


The last expression in (3.58) will be a zero matrix whenever Pz X = O. This 
condition will hold when the two sets of regressors X and Z are mutually 
orthogonal, so that Z'X = O. In this special case, B will be as efficient as Ê. 
In general, however, including regressors that do not belong in a model will 
increase the variance of the estimates of the coefficients on the regressors that 
do belong, and the increase can be very great in many cases. As can be seen 
from the left-hand side of (3.57), the variance of the estimated coefficient 3 
associated with any regressor x is proportional to the inverse of the SSR from 
a regression of æ on all the other regressors. The more other regressors there 
are, whether they truly belong in the model or not, the smaller will be this 
SSR, and, in consequence, the larger will be the variance of (3. 


Underspecification 


The opposite of overspecification is underspecification, in which we omit some 
variables that actually do appear in the DGP. To avoid any new notation, let 
us suppose that the model we estimate is (3.55), which yields the estimator 3, 
but that the DGP is really 


y=XBot+Zyt+u, u ~ HD(0, gI). (3.59) 
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Thus the situation is precisely the opposite of the one considered above. The 
estimator 8, based on regression (3.51), is now the “correct” one to use, while 
the estimator B is based on an underspecified model. It is clear that under- 
specification, unlike overspecification, is a form of misspecification, because 
the DGP (3.59) does not belong to the model (3.55). 


The first point to recognize about B is that it is now, in general, biased. Sub- 
stituting the right-hand side of (3.59) for y in (3.04), and taking expectations 
conditional on X and Z, we find that 


E(@) = E((X'X) X "(XG + Zy + u)) 
= bo + (X'X) 1X Zo + E((X'X)'XTu) (3.60) 
= By + (XTX) XT Zp. 


The second term in the last line of (3.60) will be equal to zero only when 
X'Z = O or yọ = 0. The first possibility arises when the two sets of 
regressors are mutually orthogonal, the second when (3.55) is not in fact 
underspecified. Except in these very special cases, B will generally be biased. 
The magnitude of the bias will depend on the parameter vector yo and on the 
X and Z matrices. Because this bias does not vanish as n — oo, B will also 
generally be inconsistent. 


Since 8 is biased, we cannot reasonably use its covariance matrix to evaluate 
its accuracy. Instead, we can use the mean squared error matrix, or MSE 
matrix, of B. This matrix is defined as 


MSE() = E((8 — Bo)(Ê — Bo)” ). (3.61) 


The MSE matrix is equal to Var (Ê) if Ê is unbiased, but not otherwise. For 
a scalar parameter Ê, the MSE is equal to the square of the bias plus the 
variance: ; j N 

MSE(3) = (E(8) — 6o)” + Var (Â). 
Thus, when we use MSE to evaluate the accuracy of an estimator, we are 
choosing to give equal weight to random errors and to systematic errors that 
arise from bias.! 


From (3.60), we can see that 
Ê- Bo = (XTX) X" Zy + (XTX) Xu. 


Therefore, 3 — Bo times itself transposed is equal to 


(XTX) 1X Zy ZX(X'X) + (KX) XK aa X(X XY! 
+ (XTX) 1X Zypu X (XX) 1 4+ (AX) HK ay ZX (XX). 


l For a scalar parameter, it is common to report the square root of the MSE, 
called the root mean squared error, or RMSE, instead of the MSE itself. 
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The second term here has expectation o?(X'X)~!, and the third and fourth 
terms, one of which is the transpose of the other, have expectation zero. Thus 
we conclude that 


MSE(@) = o2(X'X)-! + (XTX)-LX Zy ZX (XTX (3.62) 


The first term is what the covariance matrix would be if we were estimating 
a correctly specified model, and the second term arises from the bias of 8. 


We would like to compare MSE({), expression (3.62), with MSE(3) = Var (8), 
which is given by expression (3.56). However, no unambiguous comparison 
is possible. The first term in (3.62) cannot be larger, in the matrix sense, 
than (3.56). Thus, if the bias is small, the second term will be small, and 
it may well be that Ê is more efficient than 3. However, if the bias is large, 
the second term will necessarily be large, and B will be less efficient than £. 
Of course, it is quite possible that some parameters may be estimated more 
efficiently by @ and others more efficiently by . 


Whether or not B happens to be more efficient than B, the covariance matrix 
for B that will be calculated by a least squares regression program will be 
incorrect. The program will attempt to estimate the first term in (3.62), 
but it will ignore the second. However, s? will typically be larger than o? if 
some regressors have been incorrectly omitted. Thus, the program will yield 
a biased estimate of the first term. 


It is tempting to conclude from this discussion that underspecification is a 
much more severe problem than overspecification. After all, the former con- 
stitutes misspecification, but the latter does not. In consequence, as we have 
seen, underspecification leads to biased estimates and an estimated covariance 
matrix that may be severely misleading, while overspecification merely leads 
to inefficiency. Therefore, it would seem that we should always err on the 
side of overspecification. If all samples were extremely large, this might be 
a reasonable conclusion. The bias caused by underspecification does not go 
away as the sample size increases, but the variances of all consistent estima- 
tors tend to zero. Therefore, in sufficiently large samples, it makes sense to 
avoid underspecification at all costs. However, in samples of modest size, the 
gain in efficiency from omitting some variables, even if their coefficients are 
not actually zero, may be very large relative to the bias that is caused by their 
omission. 


3.8 Measures of Goodness of Fit 


A natural question to ask about any regression is: How well does it fit? There 
is more than one way to answer this question, and none of the answers may 
be entirely satisfactory in every case. 
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One possibility might be to use s, the estimated standard error of the regres- 
sion. But s can be rather hard to interpret, since it depends on the scale of 
the y+. When the regressand is in logarithms, however, s is meaningful and 
easy to interpret. Consider the loglinear model 


log y: = pı + b2 log Xt2 + 3 log Xi3 + u. (3.63) 


As we saw in Section 1.3, this model can be obtained by taking logarithms of 
both sides of the model 
y = AXE? XB e", (3.64) 


The error factor e“* is, for uz small, approximately equal to 1 + u+. Thus the 
standard deviation of uz in (3.63) is, approximately, the standard deviation of 
the proportional error in the regression (3.64). Therefore, for any regression 
where the dependent variable is in logs, we can simply interpret 100s, provided 
it is small, as an estimate of the percentage error in the regression. 


When the regressand is not in logarithms, we could divide s by y, the average 
of the y+, or perhaps by the average absolute value of y; if they were not all 
of the same sign. This would provide a measure of how large are the errors in 
the regression relative to the magnitude of the dependent variable. In many 
cases, s/y (for a model in levels) or s (for a model in logarithms) will provide 
a useful measure of how well a regression fits. However, these measures are 
not entirely satisfactory. They are bounded from below, since they cannot be 
negative, but they are not bounded from above. Moreover, s/¥ is very hard 
to interpret if yp can be either positive or negative. 


A much more commonly used (and misused) measure of goodness of fit is 
the coefficient of determination, or R?, which we introduced in Section 2.5. 
In that section, we discussed two versions of R?: the centered version, R?, 
and the uncentered version, R2. As we saw there, both versions are based 
on Pythagoras’ Theorem, which allows the total sum of squares (TSS) to be 
broken into two parts, the explained sum of squares (ESS) and the sum of 
squared residuals (SSR). Both versions of R? can be written as 


where ESS and TSS are calculated around zero for R2 and around the mean of 
the regressand for R?. The centered version is much more commonly encoun- 
tered than the uncentered version, because it is invariant to changes in the 
mean of the regressand. By adding a large enough constant to all the y4, we 
could always make R? become arbitrarily close to 1, at least if the regression 
included a constant, since the SSR would stay the same and the TSS would 
increase without limit. We discussed an example of this in Section 2.5. 


One important limitation of both versions of R? is that they are valid only 
if a regression model is estimated by least squares, since otherwise it will not 
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be true that TSS = ESS + SSR. Moreover, as we saw in Section 2.5, the 
centered version is not valid if the regressors do not include a constant term 
or the equivalent, that is, if ¿, the vector of 1s, does not belong to 8(X). 


Another, possibly undesirable, feature of both R? and RŽ as measures of 
goodness of fit is that both increase whenever more regressors are added. To 
demonstrate this, we argue in terms of R?, but the FWL Theorem can be 
used to show that the same results hold for R2. Consider once more the 
restricted and unrestricted models, (3.55) and (3.51), respectively. Since both 
regressions have the same dependent variable, they have the same TSS. Thus 
the regression with the larger ESS will also have the larger R?. The ESS from 
(3.51) is ||Px,zy||? and that from (3.55) is | Pxyl|?, and so the difference 
between them is 

y (Px,z — Px)y. (3.65) 


Clearly, S(X) c 8(X,Z). Thus Px projects on to a subspace of the image 
of Px,z. This implies that the matrix in the middle of (3.65), say Q, is an 
orthogonal projection matrix; see Exercise 2.17. Consequently, (3.65) takes 
the form y'Qy = ||Qy||? > 0. The ESS from (3.51) is therefore no less than 
that from (3.55), and so the R? from (3.51) is no less than that from (3.55). 


The R? can be modified so that adding additional regressors does not neces- 
sarily increase its value. If 4 € 8(X), the centered R? can be written as 


R? =] D ai? 
° ie 


The numerator of the second term is just the SSR which, as we saw in Sec- 
tion 3.6, has expectation (n — k)oa under standard assumptions. The denom- 
inator can be thought of as an estimator of n times the variance of y; about 
its true mean. As such, it will have expectation (n — 1)Var(y). Thus the 
second term of (3.66) can be thought of as the ratio of two biased estimators. 
If we replace these biased estimators by unbiased estimators, we obtain the 
adjusted R?, 


(3.66) 


wok Ei ay = (n — 1)y'Mxy 


R =1 i + == = . 
ra baalu y) (n— k)y'M,y 


(3.67) 


The adjusted R? is reported by virtually all regression packages, often in 
preference to R2. However, R? is really no more informative than R2. The 
two will generally be very similar, except when (n — k)/(n — 1) is noticeably 
less than 1. 


One nice feature of R? and R? is that they are constrained to lie between 0 
and 1. In contrast, R? can actually be negative. If a model has very little 
explanatory power, it is conceivable that (n — 1)/(n — k) may be greater than 
TSS/SSR. When that happens, R? < 0. 
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The widespread use of R? dates from the early days of econometrics, when 
sample sizes were often small, and investigators were easily impressed by mod- 
els that yielded large values of R2. We saw above that adding an extra regres- 
sor to a linear regression will alway increase R2. This increase can be quite 
noticeable when the sample size is small, even if the added regressor does not 
really belong in the regression. In contrast, adding an extra regressor will 
increase R? only if the proportional reduction in the SSR is greater than the 
proportional reduction in n — k. Therefore, a naive investigator who tries to 
maximize R? is less likely to end up choosing a severely overspecified model 
than one who tries to maximize R?. 


It can be extremely misleading to compare any form of R? for models es- 
timated using different data sets. Suppose, for example, that we estimate 
Model 1 using a set of data for which the regressors, and consequently the 
regressand, vary a lot, and we estimate Model 2 using a second set of data for 
which both the regressors and the regressand vary much less. Then, even if 
both models fit equally well, in the sense that their residuals have just about 
the same variance, Model 1 will have a much larger R? than Model 2. This 
can most easily be seen from (3.66). Increasing the denominator of the second 
term while holding the numerator constant will evidently increase the R?. 


3.9 Final Remarks 


In this chapter, we have dealt with many of the most fundamental, and best- 
known, statistical properties of ordinary least squares. In particular, we have 
discussed the properties of B as an estimator of @ and of s? as an estimator 
of og. We have also derived Var (8), the covariance matrix of 8, and shown 
how to estimate it. However, we have not said anything about how to use 8 


and the estimate of Var() to make inferences about 3. This important topic 
will be taken up in the next chapter. 


3.10 Exercises 


3.1 Generate a sample of size 25 from the model (3.11), with 81 = 1 and G2 = 0.8. 
For simplicity, assume that yo = 0 and that the wz are NID(0,1). Use this 
sample to compute the OLS estimates By and Bo. Repeat at least 100 times, 
and find the averages of the By and the Bo. Use these averages to estimate 
the bias of the OLS estimators of 6; and 82. 


Repeat this exercise for sample sizes of 50, 100, and 200. What happens to 
the bias of 61 and (2 as the sample size is increased? 


3.2 Consider a sequence of random variables z+, t = 1,...,00, such that E(x) = 
Lit. By considering the centered variables x; — ut, show that the law of large 
numbers can be formulated as 


nm nm 
plim 1 ) a, = lim E ) lt- 
n Parwo 
t=1 t=1 


n—> oo 
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3.4 


3.5 


3.6 


3.7 
3.8 
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Using the data on consumption and personal disposable income in Canada for 
the period 1947:1 to 1996:4 in the file consumption.data, estimate the model 


Ct = b1 + Coy tut, ut ~ NID(0, 0°), 


where ct = log Cr is the log of consumption and yz = log Y; is the log of 
disposable income, for the entire sample period. Then use the estimates of 
(G1, G2, and o to obtain 200 simulated observations on ct. 


Begin by regressing your simulated log consumption variable on the log of 
income and a constant using just the first 3 observations. Save the estimates 
of 61, G2, and o. Repeat this exercise for sample sizes of 4,5,...,200. Plot 
your estimates of G2 and o as a function of the sample size. What happens 
to these estimates as the sample size grows? 


Repeat the complete exercise with a different set of simulated consumption 
data. Which features of the paths of the parameter estimates are common to 
the two experiments, and which are different? 


Plot the EDF (empirical distribution function) of the residuals from OLS 
estimation using one of the sets of simulated data, for the entire sample period, 
that you obtained in the last exercise; see Exercise 1.1 for a definition of the 
EDF. On the same graph, plot the CDF of the N(0, o?) distribution, where 


o? now denotes the variance you used to simulate the log of consumption. 


Show that the distributions characterized by the EDF and the normal CDF 
have the same mean but different variances. How could you modify the resid- 
uals so that the EDF of the modified residuals would have the same variance, 
o”, as the normal CDF? 


In Section 3.4, it is stated that the covariance matrix Var(b) of any ran- 
dom k-vector b is positive semidefinite. Prove this fact by considering arbi- 
trary linear combinations w'b of the components of b with nonrandom w. If 
Var(b) is positive semidefinite without being positive definite, what can you 
say about b? 


For any pair of random variables, bı and b2, show, by using the fact that the 
covariance matrix of b = [bı į bg] is positive semidefinite, that 


(Cov(b1, by))? < Var(b,) Var(b2). 


Use this result to show that the correlation of bı and bs lies between —1 and 1. 
If A is a positive definite matrix, show that AT! is also positive definite. 


If A is a symmetric positive definite k x k matrix, then I — A is positive 
definite if and only if AT! — TI is positive definite, where I is the k x k identity 
matrix. Prove this result by considering the quadratic form a! (I — A)x and 
expressing x as R` lz, where R is a symmetric matrix such that A = R?. 


Extend the above result to show that, if A and B are symmetric positive 
definite matrices of the same dimensions, then A — B is positive definite if 
and only if B-t — A`! is positive definite. 
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3.9 


3.10 


3.11 


3.12 


3.13 


3.14 


3.15 


3.16 


3.17 


Show that the variance of a sum of random variables zł, t = 1,...,n, with 
Cov(zt, zs) = 0 for t Æ s, equals the sum of their individual variances, what- 
ever their expectations may be. 


Ify=w'p= ae wibi, show that Var(¥), which is given by (3.33), can 
also be written as 


k k i-l 
`> w? Var (ĝi) + 2X `> WiWj Cov(G;, Âj). (3.68) 
i=1 


i=2 j=1 


Using the data in the file consumption.data, construct the variables c+, the 
logarithm of consumption, and yz, the logarithm of income, and their first 
differences Act = ct — ct—1 and Ayt = yt — yt—1. Use these data to estimate 
the following model for the period 1953:1 to 1996:4: 


Ac; = b1 + b2Ay: + b3 Ayrı + BaAyz—2 + b5 Ayt—3 + BeAyr—a. (3.69) 


Let y = Sia Bi. Calculate Ẹ and its standard error in two different ways. 
One method should explicitly use the result (3.33), and the other should use 
a transformation of regression (3.69) which allows Ẹ and its standard error to 
be read off directly from the regression output. 


Starting from equation (3.42) and using the result proved in Exercise 3.9, but 
without using (3.43), prove that, if E(u?) = 02 and E(usuz) = 0 for all s £ t, 
then Var(tit) = (1 — ht)o@. This is the result (3.44). 


Use the result (3.44) to show that the MM estimator 6? of (3.46) is consistent. 
You may assume that a LLN applies to the average in that equation. 


Prove that E(û'ùů) = (n — k) og. This is the result (3.48). The proof should 
make use of the fact that the trace of a product of matrices is invariant to 
cyclic permutations; see Section 2.6. 


Consider two linear regressions, one restricted and the other unrestricted: 
y=xXB+u and 
y=XB+2Zy+u. 


Show that, in the case of mutually orthogonal regressors, with X'Z = O, 
the estimates of 3 from the two regressions are identical. 


Suppose that you use the OLS estimates 3, obtained by regressing the n x 1 
vector y on the n x k matrix X, to forecast the nx x 1 vector yx using the 
nx X k matrix X. Assuming that the error terms, both within the sample 
used to estimate the parameters G and outside the sample in the forecast 
period, are IID(0, o°), and that the model is correctly specified, what is the 
covariance matrix of the vector of forecast errors? 


The class of estimators considered by the Gauss-Markov Theorem can be 
written as B = Ay, with AX = I. Show that this class of estimators is in 
fact identical to the class of MM estimators of the form 


B= (W'X)'W'y, 
where W is a matrix of exogenous variables such that W'X is nonsingular. 
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3.18 


3.19 


3.20 


3.21 


3.22 


3.23 


The Statistical Properties of Ordinary Least Squares 


Show that the difference between the unrestricted estimator B of model (3.51) 
and the restricted estimator 3 of model (3.55) is given by 


Š- Ê = (X'Mz Xy 'X'MzMxy. 


Hint: In order to prove this result, it is easiest to premultiply the difference 
by X'MzX. 


Consider the linear regression model 


yt = b1 + BoXt2 + 83X43 + ut. 


Explain how you could estimate this model subject to the restriction that 
b2 + 63 = 1 by running a regression that imposes the restriction. Also, 
explain how you could estimate the unrestricted model in such a way that the 
value of one of the coefficients would be zero if the restriction held exactly for 
your data. 


Prove that, for a linear regression model with a constant term, the uncentered 
RŽ, is always greater than the centered RŽ. 


Consider a linear regression model for a dependent variable y; that has a 
sample mean of 17.21. Suppose that we create a new variable y4 = yz + 10 
and run the same linear regression using y; instead of y¢ as the regressand. 
How will RŽ, RŽ, and the estimate of the constant term be related in the two 
regressions? What if instead y; = yt — 10? 


Using the data in the file consumption.data, construct the variables c+, the 
logarithm of consumption, and yz, the logarithm of income. Use them to esti- 
mate, for the period 1953:1 to 1996:4, the following autoregressive distributed 
lag, or ADL, model: 


ce = a+ Bct—1 + oye + V1yt—-1 + ut. (3.70) 
Such models are often expressed in first-difference form, that is, as 
Act = 6 + bezx_1 + OAy: + YY + Ut, (3.71) 


where the first-difference operator A is defined so that Act = Ct — C1. 
Estimate the first-difference model (3.71), and then, without using the results 
of (3.70), rederive the estimates of a, 3, yo, and 71 solely on the basis of your 
results from (3.71). 


Simulate model (3.70) of the previous question, using your estimates of a, p, 
Yo, ¥1, and the error variance o”. Perform the simulation conditional on the 
income series and the first observation cy of consumption. Plot the residuals 
from running (3.70) on the simulated data, and compare the plot with that 
of the residuals from the real data. Comments? 
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Chapter 4 


Hypothesis Testing in 
Linear Regression Models 


4.1 Introduction 


As we saw in Chapter 3, the vector of OLS parameter estimates B is arandom 
vector. Since it would be an astonishing coincidence if B were equal to the 
true parameter vector Bo in any finite sample, we must take the randomness 
of B into account if we are to make inferences about 8. In classical economet- 
rics, the two principal ways of doing this are performing hypothesis tests and 
constructing confidence intervals or, more generally, confidence regions. We 
will discuss the first of these topics in this chapter, as the title implies, and the 
second in the next chapter. Hypothesis testing is easier to understand than 
the construction of confidence intervals, and it plays a larger role in applied 
econometrics. 


In the next section, we develop the fundamental ideas of hypothesis testing 
in the context of a very simple special case. Then, in Section 4.3, we review 
some of the properties of several distributions which are related to the nor- 
mal distribution and are commonly encountered in the context of hypothesis 
testing. We will need this material for Section 4.4, in which we develop a 
number of results about hypothesis tests in the classical normal linear model. 
In Section 4.5, we relax some of the assumptions of that model and introduce 
large-sample tests. An alternative approach to testing under relatively weak 
assumptions is bootstrap testing, which we introduce in Section 4.6. Finally, 
in Section 4.7, we discuss what determines the ability of a test to reject a 
hypothesis that is false. 


4.2 Basic Ideas 
The very simplest sort of hypothesis test concerns the (population) mean from 


which a random sample has been drawn. To test such a hypothesis, we may 
assume that the data are generated by the regression model 


Ut = B + Ut, Ut ~ IID(0, 0”), (4.01) 
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where y; is an observation on the dependent variable, G is the population 
mean, which is the only parameter of the regression function, and g? is the 
variance of the error term u;. The least squares estimator of @ and its variance, 
for a sample of size n, are given by 


b= 1y Yt and Var (3) = 1g, (4.02) 
t=1 


These formulas can either be obtained from first principles or as special cases 
of the general results for OLS estimation. In this case, X is just an n-vector 
of 1s. Thus, for the model (4.01), the standard formulas Ê = (X'X)"1XTy 
and Var(@) = 0?(X'X)-! yield the two formulas given in (4.02). 


Now suppose that we wish to test the hypothesis that G = Bo, where (J is 
some specified value of 3.! The hypothesis that we are testing is called the 
null hypothesis. It is often given the label Ho for short. In order to test Ho, 
we must calculate a test statistic, which is a random variable that has a known 
distribution when the null hypothesis is true and some other distribution when 
the null hypothesis is false. If the value of this test statistic is one that might 
frequently be encountered by chance under the null hypothesis, then the test 
provides no evidence against the null. On the other hand, if the value of the 
test statistic is an extreme one that would rarely be encountered by chance 
under the null, then the test does provide evidence against the null. If this 
evidence is sufficiently convincing, we may decide to reject the null hypothesis 
that 8 = 6o. 


For the moment, we will restrict the model (4.01) by making two very strong 
assumptions. The first is that uz is normally distributed, and the second 
is that o is known. Under these assumptions, a test of the hypothesis that 
GB = Bo can be based on the test statistic 


3 E 
z= -o =" (Bb). (4.03) 


It turns out that, under the null hypothesis, z must be distributed as N (0,1). 
It must have mean 0 because ( is an unbiased estimator of 3, and @ = bo 
under the null. It must have variance unity because, by (4.02), 


1 Te may be slightly confusing that a 0 subscript is used here to denote the value 
of a parameter under the null hypothesis as well as its true value. So long 
as it is assumed that the null hypothesis is true, however, there should be no 
possible confusion. 
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Finally, to see that z must be normally distributed, note that B is just the 
average of the y+, each of which must be normally distributed if the corre- 
sponding uz is; see Exercise 1.7. As we will see in the next section, this 
implies that z is also normally distributed. Thus z has the first property that 
we would like a test statistic to possess: It has a known distribution under 
the null hypothesis. 


For every null hypothesis there is, at least implicitly, an alternative hypothesis, 
which is often given the label Hı. The alternative hypothesis is what we are 
testing the null against, in this case the model (4.01) with 6 4 bo. Just as 
important as the fact that z follows the N(0,1) distribution under the null is 
the fact that z does not follow this distribution under the alternative. Suppose 
that @ takes on some other value, say 61. Then it is clear that 8 = 9, +74, 
where Ẹ has mean 0 and variance o?/n; recall equation (3.05). In fact, Ẹ 
is normal under our assumption that the u; are normal, just like ô, and so 
4 ~ N(0,07/n). It follows that z is also normal (see Exercise 1.7 again), and 
we find from (4.03) that 


mil? 


Therefore, provided n is sufficiently large, we would expect the mean of z to 
be large and positive if G; > @o and large and negative if 61 < Go. Thus we 
will reject the null hypothesis whenever z is sufficiently far from 0. Just how 
we can decide what “sufficiently far” means will be discussed shortly. 


Since we want to test the null that 8 = Go against the alternative that 8 Æ 6o, 
we must perform a two-tailed test and reject the null whenever the absolute 
value of z is sufficiently large. If instead we were interested in testing the 
null hypothesis that 8 < Bo against the alternative that 8 > Go, we would 
perform a one-tailed test and reject the null whenever z was sufficiently large 
and positive. In general, tests of equality restrictions are two-tailed tests, and 
tests of inequality restrictions are one-tailed tests. 


Since z is a random variable that can, in principle, take on any value on the 
real line, no value of z is absolutely incompatible with the null hypothesis, 
and so we can never be absolutely certain that the null hypothesis is false. 
One way to deal with this situation is to decide in advance on a rejection rule, 
according to which we will choose to reject the null hypothesis if and only if 
the value of z falls into the rejection region of the rule. For two-tailed tests, 
the appropriate rejection region is the union of two sets, one containing all 
values of z greater than some positive value, the other all values of z less than 
some negative value. For a one-tailed test, the rejection region would consist 
of just one set, containing either sufficiently positive or sufficiently negative 
values of z, according to the sign of the inequality we wish to test. 


A test statistic combined with a rejection rule is sometimes called simply a 
test. If the test incorrectly leads us to reject a null hypothesis that is true, 
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we are said to make a Type I error. The probability of making such an error 
is, by construction, the probability, under the null hypothesis, that z falls 
into the rejection region. This probability is sometimes called the level of 
significance, or just the level, of the test. A common notation for this is a. 
Like all probabilities, œ is a number between 0 and 1, although, in practice, it 
is generally much closer to 0 than 1. Popular values of a include .05 and .01. 
If the observed value of z, say Z, lies in a rejection region associated with a 
probability under the null of a, we will reject the null hypothesis at level a, 
otherwise we will not reject the null hypothesis. In this way, we ensure that 
the probability of making a Type I error is precisely a. 


In the previous paragraph, we implicitly assumed that the distribution of the 
test statistic under the null hypothesis is known exactly, so that we have what 
is called an exact test. In econometrics, however, the distribution of a test 
statistic is often known only approximately. In this case, we need to draw a 
distinction between the nominal level of the test, that is, the probability of 
making a Type I error according to whatever approximate distribution we are 
using to determine the rejection region, and the actual rejection probability, 
which may differ greatly from the nominal level. The rejection probability is 
generally unknowable in practice, because it typically depends on unknown 
features of the DGP.? 


The probability that a test will reject the null is called the power of the test. 
If the data are generated by a DGP that satisfies the null hypothesis, the 
power of an exact test is equal to its level. In general, power will depend on 
precisely how the data were generated and on the sample size. We can see 
from (4.04) that the distribution of z is entirely determined by the value of À, 
with \ = 0 under the null, and that the value of A depends on the parameters 
of the DGP. In this example, À is proportional to 3, — Go and to the square 
root of the sample size, and it is inversely proportional to ø. 


Values of A different from 0 move the probability mass of the N(A, 1) distribu- 
tion away from the center of the N(0,1) distribution and into its tails. This 
can be seen in Figure 4.1, which graphs the N(0,1) density and the N(A, 1) 
density for A = 2. The second density places much more probability than the 
first on values of z greater than 2. Thus, if the rejection region for our test 
was the interval from 2 to +00, there would be a much higher probability in 
that region for A = 2 than for A = 0. Therefore, we would reject the null 
hypothesis more often when the null hypothesis is false, with A = 2, than 
when it is true, with \ = 0. 


2 Another term that often arises in the discussion of hypothesis testing is the size 
of a test. Technically, this is the supremum of the rejection probability over all 
DGPs that satisfy the null hypothesis. For an exact test, the size equals the 
level. For an approximate test, the size is typically difficult or impossible to 
calculate. It is often, but by no means always, greater than the nominal level 
of the test. 
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Figure 4.1 The normal distribution centered and uncentered 


Mistakenly failing to reject a false null hypothesis is called making a Type II 
error. The probability of making such a mistake is equal to 1 minus the 
power of the test. It is not hard to see that, quite generally, the probability of 
rejecting the null with a two-tailed test based on z increases with the absolute 
value of A. Consequently, the power of such a test will increase as 3, — 6o 
increases, as o decreases, and as the sample size increases. We will discuss 
what determines the power of a test in more detail in Section 4.7. 


In order to construct the rejection region for a test at level a, the first step 
is to calculate the critical value associated with the level a. For a two-tailed 
test based on any test statistic that is distributed as N(0,1), including the 
statistic z defined in (4.04), the critical value ca is defined implicitly by 


(ca) = 1 — a/2. (4.05) 


Recall that @ denotes the CDF of the standard normal distribution. In terms 
of the inverse function ®~+, ca can be defined explicitly by the formula 


Cy = & 1(1 — a/2). (4.06) 


According to (4.05), the probability that z > ca is 1 — (1 — a/2) = a/2, and 
the probability that z < —cg is also a/2, by symmetry. Thus the probability 
that |z| > cq is a, and so an appropriate rejection region for a test at level a 
is the set defined by |z| > cg. Clearly, Ca increases as œ approaches 0. As 
an example, when a = .05, we see from (4.06) that the critical value for a 
two-tailed test is ®~1(.975) = 1.96. We would reject the null at the .05 level 
whenever the observed absolute value of the test statistic exceeds 1.96. 


P Values 


As we have defined it, the result of a test is yes or no: Reject or do not 
reject. A more sophisticated approach to deciding whether or not to reject 
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the null hypothesis is to calculate the P value, or marginal significance level, 
associated with the observed test statistic 2. The P value for Z is defined as the 
greatest level for which a test based on 2 fails to reject the null. Equivalently, 
at least if the statistic z has a continuous distribution, it is the smallest level 
for which the test rejects. Thus, the test rejects for all levels greater than the 
P value, and it fails to reject for all levels smaller than the P value. Therefore, 
if the P value associated with Z is denoted p(Z), we must be prepared to accept 
a probability p(Z) of Type I error if we choose to reject the null. 


For a two-tailed test, in the special case we have been discussing, 
pte) = 2(1 — ®((|2|)). (4.07) 


To see this, note that the test based on 2 rejects at level a if and only if 
|2| > cq. This inequality is equivalent to ®(|2|) > (ca), because ®(-) is 
a strictly increasing function. Further, ®(cy) = 1 — a/2, by (4.05). The 
smallest value of a for which the inequality holds is thus obtained by solving 
the equation 

@([2|) = 1 — a/2, 


and the solution is easily seen to be the right-hand side of (4.07). 


One advantage of using P values is that they preserve all the information 
conveyed by a test statistic, while presenting it in a way that is directly 
interpretable. For example, the test statistics 2.02 and 5.77 would both lead 
us to reject the null at the .05 level using a two-tailed test. The second of 
these obviously provides more evidence against the null than does the first, 
but it is only after they are converted to P values that the magnitude of the 
difference becomes apparent. The P value for the first test statistic is .0434, 
while the P value for the second is 7.93 x 107°, an extremely small number. 


Computing a P value transforms z from a random variable with the N(0, 1) 
distribution into a new random variable p(z) with the uniform U (0,1) dis- 
tribution. In Exercise 4.1, readers are invited to prove this fact. It is quite 
possible to think of p(z) as a test statistic, of which the observed realization 
is p(Z). A test at level a rejects whenever p(Z) < a. Note that the sign of 
this inequality is the opposite of that in the condition || > cg. Generally, 
one rejects for large values of test statistics, but for small P values. 


Figure 4.2 illustrates how the test statistic 2 is related to its P value p(2). 
Suppose that the value of the test statistic is 1.51. Then 
Pr(z > 1.51) = Pr(z < —1.51) = .0655. (4.08) 


This implies, by equation (4.07), that the P value for a two-tailed test based 
on Z is .1310. The top panel of the figure illustrates (4.08) in terms of the 
PDF of the standard normal distribution, and the bottom panel illustrates it 
in terms of the CDF. To avoid clutter, no critical values are shown on the 
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Figure 4.2 P values for a two-tailed test 


figure, but it is clear that a test based on 2 will not reject at any level smaller 
than .131. From the figure, it is also easy to see that the P value for a one- 
tailed test of the hypothesis that 6 < 6o is .0655. This is just Pr(z > 1.51). 
Similarly, the P value for a one-tailed test of the hypothesis that @ > 6o is 
Pr(z < 1.51) = .9345. 


In this section, we have introduced the basic ideas of hypothesis testing. How- 
ever, we had to make two very restrictive assumptions. The first is that the 
error terms are normally distributed, and the second, which is grossly unreal- 
istic, is that the variance of the error terms is known. In addition, we limited 
our attention to a single restriction on a single parameter. In Section 4.4, we 
will discuss the more general case of linear restrictions on the parameters of 
a linear regression model with unknown error variance. Before we can do so, 
however, we need to review the properties of the normal distribution and of 
several distributions that are closely related to it. 
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4.3 Some Common Distributions 


Most test statistics in econometrics follow one of four well-known distribu- 
tions, at least approximately. These are the standard normal distribution, 
the chi-squared (or x°) distribution, the Student’s t distribution, and the 
F distribution. The most basic of these is the normal distribution, since the 
other three distributions can be derived from it. In this section, we discuss the 
standard, or central, versions of these distributions. Later, in Section 4.7, we 
will have occasion to introduce noncentral versions of all these distributions. 


The Normal Distribution 


The normal distribution, which is sometimes called the Gaussian distribu- 
tion in honor of the celebrated German mathematician and astronomer Carl 
Friedrich Gauss (1777-1855), even though he did not invent it, is certainly 
the most famous distribution in statistics. As we saw in Section 1.2, there 
is a whole family of normal distributions, all based on the standard normal 
distribution, so called because it has mean 0 and variance 1. The PDF of the 
standard normal distribution, which is usually denoted by ¢(-), was defined 
in (1.06). No elementary closed-form expression exists for its CDF, which is 
usually denoted by ®(-). Although there is no closed form, it is perfectly easy 
to evaluate ® numerically, and virtually every program for doing econometrics 
and statistics can do this. Thus it is straightforward to compute the P value 
for any test statistic that is distributed as standard normal. The graphs of 
the functions ¢ and ® were first shown in Figure 1.1 and have just reappeared 
in Figure 4.2. In both tails, the PDF rapidly approaches 0. Thus, although 
a standard normal r.v. can, in principle, take on any value on the real line, 
values greater than about 4 in absolute value occur extremely rarely. 


In Exercise 1.7, readers were asked to show that the full normal family can be 
generated by varying exactly two parameters, the mean and the variance. A 
random variable X that is normally distributed with mean p and variance g? 
can be generated by the formula 


X=p+oZ, (4.09) 


where Z is standard normal. The distribution of X, that is, the normal 
distribution with mean u and variance o°, is denoted N(,07). Thus the 
standard normal distribution is the N(0,1) distribution. As readers were 
asked to show in Exercise 1.8, the PDF of the N (u, 07) distribution, evaluated 


at x, is > 
ze(=—) = 7 = exp( eo) } (4.10) 


o 202 


In expression (4.10), as in Section 1.2, we have distinguished between the 
random variable X and a value x that it can take on. However, for the 
following discussion, this distinction is more confusing than illuminating. For 
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the rest of this section, we therefore use lower-case letters to denote both 
random variables and the arguments of their PDFs or CDFs, depending on 
context. No confusion should result. Adopting this convention, then, we 
see that, if x is distributed as N(u,07), we can invert (4.09) and obtain 
z = (x — u)/o, where z is standard normal. Note also that z is the argument 
of @ in the expression (4.10) of the PDF of x. In general, the PDF of a 
normal variable x with mean u and variance o° is 1/o times ¢ evaluated at 
the corresponding standard normal variable, which is z = (x — p)/o. 


Although the normal distribution is fully characterized by its first two mo- 
ments, the higher moments are also important. Because the distribution is 
symmetric around its mean, the third central moment, which measures the 
skewness of the distribution, is always zero. This is true for all of the odd 
central moments. The fourth moment of a symmetric distribution provides a 
way to measure its kurtosis, which essentially means how thick the tails are. 
In the case of the N(u, 07) distribution, the fourth central moment is 304; see 
Exercise 4.2. 


Linear Combinations of Normal Variables 


An important property of the normal distribution, used in our discussion in 
the preceding section, is that any linear combination of independent normally 
distributed random variables is itself normally distributed. To see this, it 
is enough to show it for independent standard normal variables, because, 
by (4.09), all normal variables can be generated as linear combinations of 
standard normal ones plus constants. We will tackle the proof in several 
steps, each of which is important in its own right. 


To begin with, let z1 and z2 be standard normal and mutually independent, 
and consider w = b121 + b2z2. For the moment, we suppose that b? + b3 =, 
although we will remove this restriction shortly. If we reason conditionally 
on 21, then we find that 


E(w | zı) = bizi + b2E(22 | zı) = bızı + b2E(z2) = by 24. 


The first equality follows because bızı is a deterministic function of the condi- 
tioning variable z1, and so can be taken outside the conditional expectation. 
The second, in which the conditional expectation of z is replaced by its un- 
conditional expectation, follows because of the independence of z1 and zz (see 
Exercise 1.9). Finally, E(z2) = 0 because z2 is N(0,1). 


The conditional variance of w is given by 
B((w = E(w | a) | z1) = E((b222)? | z1) = E((b222)°) = ie. 


3 A distribution is said to be skewed to the right if the third central moment is 
positive, and to the left if the third central moment is negative. 
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where the last equality again follows because z2 ~ N(0,1). Conditionally 
on zı, w is the sum of the constant bızı and bo times a standard normal 
variable z2, and so the conditional distribution of w is normal. Given the 
conditional mean and variance we have just computed, we see that the con- 
ditional distribution must be N (b121, b3). The PDF of this distribution is the 
density of w conditional on z1, and, by (4.10), it is 


a) (4.11) 


fw|a) = Ee 


In accord with what we noted above, the argument of ¢ here is equal to 22, 
which is the standard normal variable corresponding to w conditional on 2. 


The next step is to find the joint density of w and 21. By (1.15), the density 
of w conditional on zı is the ratio of the joint density of w and zı to the 
marginal density of z1. This marginal density is just ¢(z1), since z1 ~ N(0,1), 
and so we see that the joint density is 


= (4.12) 


Foa) = fer) wla = eae bg 


If we use (1.06) to get an explicit expression for this joint density, then we 


obtain 
1 


27b5 


1 ier s 
~ Inds exp( 202 (2 —2baw + w*)), 


1 
exp( e (b321 +w? — 2b1z1w + biz?) 
í (4.13) 


since we assumed that b? + 63 = 1. The right-hand side of (4.13) is symmetric 
with respect to zı and w. Thus the joint density can also be expressed as 
in (4.12), but with zı and w interchanged, as follows: 


Za ey, 


Foz) = E o(w)o( 2S 


bs (4.14) 
We are now ready to compute the unconditional, or marginal, density of w. 
To do so, we integrate the joint density (4.14) with respect to z1; see (1.12). 
Note that zı occurs only in the last factor on the right-hand side of (4.14). 
Further, the expression (1/b2)¢((z1 — b1w)/b2), like expression (4.11), is a 
probability density, and so it integrates to 1. Thus we conclude that the 
marginal density of w is f(w) = ¢(w), and so it follows that w is standard 
normal, unconditionally, as we wished to show. 


It is now simple to extend this argument to the case for which b? + b2 Æ 1. 
We define r? = b? + b3, and consider w/r. The argument above shows that 
w/r is standard normal, and so w ~ N(0,r?). It is equally simple to extend 
the result to a linear combination of any number of mutually independent 
standard normal variables. If we now let w be defined as b121 + b2z2 + b323, 
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where z1, 22, and z3 are mutually independent standard normal variables, then 
bızı +b2z2 is normal by the result for two variables, and it is independent of z3. 
Thus, by applying the result for two variables again, this time to b1z1 + b222 
and z3, we see that w is normal. This reasoning can obviously be extended 
by induction to a linear combination of any number of independent standard 
normal variables. Finally, if we consider a linear combination of independent 
normal variables with nonzero means, the mean of the resulting variable is 
just the same linear combination of the means of the individual variables. 


The Multivariate Normal Distribution 


The results of the previous subsection can be extended to linear combina- 
tions of normal random variables that are not necessarily independent. In 
order to do so, we introduce the multivariate normal distribution. As the 
name suggests, this is a family of distributions for random vectors, with the 
scalar normal distributions being special cases of it. The pair of random 
variables zı and w considered above follow the bivariate normal distribution, 
another special case of the multivariate normal distribution. As we will see 
in a moment, all these distributions, like the scalar normal distribution, are 
completely characterized by their first two moments. 


In order to construct the multivariate normal distribution, we begin with a 
set of m mutually independent standard normal variables, z;, i = 1,...,m, 
which we can assemble into a random m-vector z. Then any m-vector x 
of linearly independent linear combinations of the components of z follows 
a multivariate normal distribution. Such a vector æ can always be written 
as Az, for some nonsingular m x m matrix A. As we will see in a moment, 
the matrix A can always be chosen to be lower-triangular. 


We denote the components of Œ as x;, i =1,...,m. From what we have seen 
above, it is clear that each x; is normally distributed, with (unconditional) 
mean zero. Therefore, from results proved in Section 3.4, it follows that the 
covariance matrix of x is 


Var(x) = E(aa') = AE(zz')A'= AIA' = AA’. 


Here we have used the fact that the covariance matrix of z is the identity 
matrix I. This is true because the variance of each component of z is 1, 
and, since the z; are mutually independent, all the covariances are 0; see 
Exercise 1.11. 


Let us denote the covariance matrix of x by 92. Recall that, according to 
a result mentioned in Section 3.4 in connection with Crout’s algorithm, for 
any positive definite matrix (2, we can always find a lower-triangular A such 
that AA’ = Q. Thus the matrix A may always be chosen to be lower- 
triangular. The distribution of æ is multivariate normal with mean vector 0 
and covariance matrix 2. We write this as x ~ N(0, Q). If we add an 
m-vector u of constants to æ, the resulting vector must follow the N (p, 2) 
distribution. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


134 Hypothesis Testing in Linear Regression Models 


T2 


ci = 1.5, og = 1, p= —0.9 oy, =1, cg =1, p=0.5 z 
1 


Figure 4.3 Contours of two bivariate normal densities 


It is clear from this argument that any linear combination of random variables 
that are jointly multivariate normal is itself normally distributed. Thus, if 
x ~ N(p, R), any scalar alx, where a is an m-vector of fixed coefficients, is 


normally distributed with mean a'y and variance a' Qa. 


We saw a moment ago that z ~ N(0,I) whenever the components of the 
vector z are independent. Another crucial property of the multivariate nor- 
mal distribution is that the converse of this result is also true: If æ is any 
multivariate normal vector with zero covariances, the components of x are 
mutually independent. This is a very special property of the multivariate 
normal distribution, and readers are asked to prove it, for the bivariate case, 
in Exercise 4.5. In general, a zero covariance between two random variables 
does not imply that they are independent. 


It is important to note that the results of the last two paragraphs do not hold 
unless the vector x is multivariate normal, that is, constructed as a set of linear 
combinations of independent normal variables. In most cases, when we have 
to deal with linear combinations of two or more normal random variables, it is 
reasonable to assume that they are jointly distributed as multivariate normal. 
However, as Exercise 1.12 illustrates, it is possible for two or more random 
variables not to be multivariate normal even though each one individually 
follows a normal distribution. 


Figure 4.3 illustrates the bivariate normal distribution, of which the PDF is 
given in Exercise 4.5 in terms of the variances a? and o of the two variables, 
and their correlation p. Contours of the density are plotted, on the right for 
0, = 02 = 1.0 and p = 0.5, on the left for o1 = 1.5, og = 1.0, and p = —0.9. 
The contours of the bivariate normal density can be seen to be elliptical. The 
ellipses slope upward when p > 0 and downward when p < 0. They do so 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


4.3 Some Common Distributions 135 


more steeply the larger is the ratio a2/01. The closer |p| is to 1, for given 
values of cı and g2, the more elongated are the elliptical contours. 


The Chi-Squared Distribution 


Suppose, as in our discussion of the multivariate normal distribution, that 
the random vector z is such that its components 21,...,2m are mutually 
independent standard normal random variables. An easy way to express this 
is to write z ~ N(0,I). Then the random variable 


ysl S225) 27 (4.15) 


i=1 


is said to follow the chi-squared distribution with m degrees of freedom. A 
compact way of writing this is: y ~ y?(m). From (4.15), it is clear that 
m must be a positive integer. In the case of a test statistic, it will turn out 
to be equal to the number of restrictions being tested. 


The mean and variance of the x? (m) distribution can easily be obtained from 
the definition (4.15). The mean is 


E(y) = > EC) =y =m: (4.16) 


1=1 


Since the z; are independent, the variance of the sum of the z? is just the sum 
of the (identical) variances: 


Var (y = vans = mE((z} — 1)”) (4.17) 


Se — 22? +1) = m(3— 2 +1) = 2m. 


The third equality here uses the fact that E(z?) = 3; see Exercise 4.2. 


Another important property of the chi-squared distribution, which follows 
immediately from (4.15), is that, if yı ~ x?(m1) and y2 ~ x7(m2), and yı 
and yz are independent, then y1 + y2 ~ y?(m 1 + m2). To see this, rewrite 
(4.15) as 


mi+tme mi+tme2 
y=y+y= yo Jasas z 
i=mı +1 i=1 


from which the result follows. 


Figure 4.4 shows the PDF of the x?(m) distribution for m = 1, m = 3, 
m = 5, and m = 7. The changes in the location and height of the density 
function as m increases are what we should expect from the results (4.16) and 
(4.17) about its mean and variance. In addition, the PDF, which is extremely 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


136 Hypothesis Testing in Linear Regression Models 


0 2 4 6 8 10 12 14 16 18 20 


Figure 4.4 Various chi-squared PDF's 


skewed to the right for m = 1, becomes less skewed as m increases. In fact, as 
we will see in Section 4.5, the y?(m) distribution approaches the N(m, 2m) 
distribution as m becomes large. 


In Section 3.4, we introduced quadratic forms. As we will see, many test 
statistics can be written as quadratic forms in normal vectors, or as functions 
of such quadratic forms. The following theorem states two results about 
quadratic forms in normal vectors that will prove to be extremely useful. 


Theorem 4.1. 


1. If the m-vector æ is distributed as N(0, 92), then the quadratic 
form æ! Ntg is distributed as x2(m); 


2. If P is a projection matrix with rank r and z is an n-vector 
that is distributed as N(0,I), then the quadratic form z! Pz is 
distributed as x?(r). 


Proof: Since the vector x is multivariate normal with mean vector 0, so is the 
vector A~!a, where, as before, AA! = 2. Moreover, the covariance matrix 
of Atz is 


E(A ææ (A't) — A224" — A'AA'(A')} — Im. 


Thus we have shown that the vector z = A` 1g is distributed as N (0, I). 


The quadratic form #'Q-+a is equal to æ! (A'J 1AT tæ = z'z. As we have 
just shown, this is equal to the sum of m independent, squared, standard 
normal random variables. From the definition of the chi-squared distribution, 
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we know that such a sum is distributed as x? (m). This proves the first part 
of the theorem. 


Since P is a projection matrix, it must project orthogonally on to some sub- 
space of E”. Suppose, then, that P projects on to the span of the columns of 
an n x r matrix Z. This allows us to write 


z'Pz = z'Z(Z'Z\)'Z'z. 


The r-vector x = Z'z evidently follows the N(0, Z'Z) distribution. There- 
fore, z! Pz is seen to be a quadratic form in the multivariate normal r-vector 
x and (Z'Z)~!, which is the inverse of its covariance matrix. That this 
quadratic form is distributed as y?(r) follows immediately from the the first 
part of the theorem. | 


The Student’s t Distribution 


If z~ N(0,1) and y ~ y?(m), and z and y are independent, then the random 
variable . 

t= ——_ 4.18 

Wm ea 

is said to follow the Student’s t distribution with m degrees of freedom. A 

compact way of writing this is: t ~ t(m). The Student’s t distribution looks 

very much like the standard normal distribution, since both are bell-shaped 


and symmetric around 0. 


The moments of the t distribution depend on m, and only the first m — 1 
moments exist. Thus the ¢(1) distribution, which is also called the Cauchy 
distribution, has no moments at all, and the ¢(2) distribution has no variance. 
From (4.18), we see that, for the Cauchy distribution, the denominator of t 
is just the absolute value of a standard normal random variable. Whenever 
this denominator happens to be close to zero, the ratio is likely to be a very 
big number, even if the numerator is not particularly large. Thus the Cauchy 
distribution has very thick tails. As m increases, the chance that the denom- 
inator of (4.18) is close to zero diminishes (see Figure 4.4), and so the tails 
become thinner. 


In general, if t is distributed as t(m) with m > 2, then Var(t) = m/(m — 2). 
Thus, as m — oo, the variance tends to 1, the variance of the standard 
normal distribution. In fact, the entire t(m) distribution tends to the standard 
normal distribution as m — oo. By (4.15), the chi-squared variable y can be 
expressed as )>,", 27, where the z; are independent standard normal variables. 
Therefore, by a law of large numbers, such as (3.16), y/m, which is the average 
of the z?, tends to its expectation as m — oo. By (4.16), this expectation is 
just m/m = 1. It follows that the denominator of (4.18), (y/m)1/, also tends 
to 1, and hence that t — z ~ N(0,1) as m — ov. 


Figure 4.5 shows the PDFs of the standard normal, t(1), ¢(2), and t(5) distri- 
butions. In order to make the differences among the various densities in the 
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Figure 4.5 PDFs of the Student’s t distribution 


figure apparent, all the values of m are chosen to be very small. However, it 
is clear from the figure that, for larger values of m, the PDF of t(m) will be 
very similar to the PDF of the standard normal distribution. 


The F Distribution 
If yı and y2 are independent random variables distributed as x°(mı) and 


x? (mə), respectively, then the random variable 


yi/my 


F= 
Y2 /Mə 


(4.19) 


is said to follow the F distribution with mı and məzə degrees of freedom. A 
compact way of writing this is: F ~ F(mı, m2). The notation F is used in 
honor of the well-known statistician R. A. Fisher. The F (m1, m2) distribution 
looks a lot like a rescaled version of the x°(mı) distribution. As for the 
t distribution, the denominator of (4.19) tends to unity as m2 — oo, and 
so mF > yı ~ x2(m1) as mz — co. Therefore, for large values of me, a 
random variable that is distributed as F'(m,, mz) will behave very much like 
1/m, times a random variable that is distributed as y?(m1). 


The F distribution is very closely related to the Student’s t distribution. It is 
evident from (4.19) and (4.18) that the square of a random variable which is 
distributed as t(m2) will be distributed as F(1, m2). In the next section, we 
will see how these two distributions arise in the context of hypothesis testing 
in linear regression models. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


4.4 Exact Tests in the Classical Normal Linear Model 139 


4.4 Exact Tests in the Classical Normal Linear Model 


In the example of Section 4.2, we were able to obtain a test statistic z that was 
distributed as N(0,1). Tests based on this statistic are exact. Unfortunately, 
it is possible to perform exact tests only in certain special cases. One very 
important special case of this type arises when we test linear restrictions on 
the parameters of the classical normal linear model, which was introduced in 
Section 3.1. This model may be written as 


y=XBt+u, u~ N(0,07D), (4.20) 


where X is an n x k matrix of regressors, so that there are n observations 
and k regressors, and it is assumed that the error vector u is statistically 
independent of the matrix X. Notice that in (4.20) the assumption which in 
Section 3.1 was written as us ~ NID(0, o°) is now expressed in matrix notation 
using the multivariate normal distribution. In addition, since the assumption 
that u and X are independent means that the generating process for X is 
independent of that for y, we can express this independence assumption by 
saying that the regressors X are exogenous in the model (4.20); the concept 
of exogeneity* was introduced in Section 1.3 and discussed in Section 3.2. 


Tests of a Single Restriction 


We begin by considering a single, linear restriction on @. This could, in 
principle, be any sort of linear restriction, for example, that 61 = 5 or G3 = (4. 
However, it simplifies the analysis, and involves no loss of generality, if we 
confine our attention to a restriction that one of the coefficients should equal 0. 
If a restriction does not naturally have the form of a zero restriction, we can 
always apply suitable linear transformations to y and X, of the sort considered 
in Sections 2.3 and 2.4, in order to rewrite the model so that it does; see 
Exercises 4.6 and 4.7. 


Let us partition 8 as [G1 i G2], where Bı is a (k — 1)-vector and (2 is a 
scalar, and consider a restriction of the form 82 = 0. When X is partitioned 
conformably with 3, the model (4.20) can be rewritten as 


y= Xıßı + B2£2 +u, ux N(0, o'I), (4.21) 


where X, denotes an n x (k — 1) matrix and a2 denotes an n-vector, with 
X= X Lo]. 

By the FWL Theorem, the least squares estimate of (G2 from (4.21) is the 
same as the least squares estimate from the FWL regression 


Myy = 32M, £2 + residuals, (4.22) 


4 This assumption is usually called strict exogeneity in the literature, but, since 
we will not discuss any other sort of exogeneity in this book, it is convenient 
to drop the word “strict”. 
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where M, = I — X (X! X) 1X is the matrix that projects on to $1(X}). 
By applying the standard formulas for the OLS estimator and covariance 
matrix to regression (4.22), under the assumption that the model (4.21) is 
correctly specified, we find that 


"m 
x £ə M- 7 
Bo = ao ied and = Var(2) = 0° (ag Mixo) t. 

T2 Mı T2 
In order to test the hypothesis that (2 equals any specified value, say BL, we 
have to subtract 8} from (2 and divide by the square root of the variance. For 
the null hypothesis that 32 = 0, this yields a test statistic analogous to (4.03), 


+ 
T2 Miy 

= 4.23 

ZB2 olx] Mıx2)!/2 ( ) 


which can be computed only under the unrealistic assumption that ø is known. 


If the data are actually generated by the model (4.21) with 82 = 0, then 
Miy = Mi(X1fi + u) = Miu. 
Therefore, the right-hand side of (4.23) becomes 


to M,u 
olx} My x2)!/2 


(4.24) 


It is now easy to see that zg, is distributed as N(0,1). Because we can 
condition on X, the only thing left in (4.24) that is stochastic is u. Since 
the numerator is just a linear combination of the components of u, which is 
multivariate normal, the entire test statistic must be normally distributed. 
The variance of the numerator is 


E(a2 Muu! M22) = ao M,E(uu') M22 


= xz Mio? IM: 29 = o° 2) My £3. 


Since the denominator of (4.24) is just the square root of the variance of 
the numerator, we conclude that zg, is distributed as N(0,1) under the null 
hypothesis. 


The test statistic zg, defined in (4.23) has exactly the same distribution under 
the null hypothesis as the test statistic z defined in (4.03). The analysis of 
Section 4.2 therefore applies to it without any change. Thus we now know 
how to test the hypothesis that any coefficient in the classical normal linear 
model is equal to 0, or to any specified value, but only if we know the variance 
of the error terms. 


In order to handle the more realistic case in which we do not know the variance 
of the error terms, we need to replace ø in (4.23) by s, the usual least squares 
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standard error estimator for model (4.21), defined in (3.49). If, as usual, Mx 
is the orthogonal projection on to 8+(X), we have s? = y'Mxy/(n—k), and 
so we obtain the test statistic 


ae x2 My 7 y Mxy ua x? Mıy 
2 s(æ]Mız) 2 \ n=k ) (æ]Mım)? 


(4.25) 


As we will now demonstrate, this test statistic is distributed as t(n — k) under 
the null hypothesis. Not surprisingly, it is called a t statistic. 


As we discussed in the last section, for a test statistic to have the t(n — k) 
distribution, it must be possible to write it as the ratio of a standard normal 
variable z to the square root of y/(n — k), where y is independent of z and 
distributed as y?(n — k). The t statistic defined in (4.25) can be rewritten as 


i= “Bs A (4.26) 
2 (y Mxy/(n = ko?) 


which has the form of such a ratio. We have already shown that zg, ~ N (0,1). 
Thus it only remains to show that y'Mxy/o? ~ x?(n — k) and that the 
random variables in the numerator and denominator of (4.26) are independent. 


Under any DGP that belongs to (4.21), 


y'Mxy 7 u'Mxu 
2 = 2 


= e' Mye, (4.27) 


o oO 


where e = u/c is distributed as N(0,I). Since Mx is a projection matrix 
with rank n — k, the second part of Theorem 4.1 shows that the rightmost 
expression in (4.27) is distributed as y?(n — k). 

To see that the random variables zg, and e'Mxe are independent, we note 
first that ¢'Mxe depends on y only through Myy. Second, from (4.23), it 
is not hard to see that zg, depends on y only through Px y, since 


z2 Mıy = x2 Px Miy = x£ (Px — Px P,)y = ao Mı Px y; 


the first equality here simply uses the fact that a2 € 8(X), and the third 
equality uses the result (2.36) that Px P, = P,P x. Independence now follows 
because, as we will see directly, Px y and Mxy are independent. 

We saw above that Mx y = Mxu. Further, from (4.20), Pxy = X6 + Pxu, 
from which it follows that the centered version of Pxy is Pxu. The n x n 
matrix of covariances of components of Pyu and Myu is thus 


E(Pxuu' Mx) = 0? Px Mx = O, 


by (2.26), because Px and Mx are complementary projections. These zero 
covariances imply that Pxu and My u are independent, since both are mul- 
tivariate normal. Geometrically, these vectors have zero covariance because 
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they lie in orthogonal subspaces, namely, the images of Px and Mx. Thus, 
even though the numerator and denominator of (4.26) both depend on y, this 
orthogonality implies that they are independent. 


We therefore conclude that the t statistic (4.26) for G2 = 0 in the model (4.21) 
has the t(n—k) distribution. Performing one-tailed and two-tailed tests based 
on tg, is almost the same as performing them based on zg,. We just have to 
use the t(n — k) distribution instead of the N(0,1) distribution to compute 
P values or critical values. An interesting property of t statistics is explored 
in Exercise 14.8. 


Tests of Several Restrictions 


Economists frequently want to test more than one linear restriction. Let us 
suppose that there are r restrictions, with r < k, since there cannot be more 
equality restrictions than there are parameters in the unrestricted model. As 
before, there will be no loss of generality if we assume that the restrictions 
take the form 32 = 0. The alternative hypothesis is the model (4.20), which 
has been rewritten as 


Ay: y= X18, + Xəß2 +u, u~ N(0,07l). (4.28) 


Here X; is an n x kı matrix, Xə is an n x kə matrix, 3, is a kj-vector, Go is 
a ko-vector, k = kı + k2, and the number of restrictions r = kp. Unless r = 1, 
it is no longer possible to use a t test, because there will be one t statistic for 
each element of 32, and we want to compute a single test statistic for all the 
restrictions at once. 


It is natural to base a test on a comparison of how well the model fits when 
the restrictions are imposed with how well it fits when they are not imposed. 
The null hypothesis is the regression model 


Ho: y=XiPitu, u~N(0,07D), (4.29) 


in which we impose the restriction that G2 = 0. As we saw in Section 3.8, 
the restricted model (4.29) must always fit worse than the unrestricted model 
(4.28), in the sense that the SSR from (4.29) cannot be smaller, and will 
almost always be larger, than the SSR from (4.28). However, if the restrictions 
are true, the reduction in SSR from adding Xə to the regression should be 
relatively small. Therefore, it seems natural to base a test statistic on the 
difference between these two SSRs. If USSR denotes the unrestricted sum 
of squared residuals, from (4.28), and RSSR denotes the restricted sum of 
squared residuals, from (4.29), the appropriate test statistic is 


Fa = (RSSR — USSR) /r 
P2 =“ USSR/(n — k) 
Under the null hypothesis, as we will now demonstrate, this test statistic fol- 


lows the F distribution with r and n—k degrees of freedom. Not surprisingly, 
it is called an F statistic. 


(4.30) 
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The restricted SSR is y'Mıy, and the unrestricted one is y'Mxy. One 
way to obtain a convenient expression for the difference between these two 
expressions is to use the FWL Theorem. By this theorem, the USSR is the 
SSR from the FWL regression 
My = Mı X22 + residuals. (4.31) 

The total sum of squares from (4.31) is y'Mıy. The explained sum of squares 
can be expressed in terms of the orthogonal projection on to the r-dimensional 
subspace 8(Mı X2), and so the difference is 

USSR = y'Mıy — y Mı X2(X_' Mı Xə) 1X1 My. (4.32) 
Therefore, 

RSSR — USSR = y'Mı X2( X} Mı X2) X} Myy, 


and the F statistic (4.30) can be written as 


j y'Mxy/(n— k) 


Under the null hypothesis, Mxy = Myu and M,y = M,u. Thus, under 
this hypothesis, the F statistic (4.33) reduces to 


e! MM, Xo( X} Mı Xo) tX} Mye/r 


e'Mxe/(n—k) i ae 


where, as before, € = u/o. We saw in the last subsection that the quadratic 
form in the denominator of (4.34) is distributed as x?2(n — k). Since the 
quadratic form in the numerator can be written as e! Pm, X£, it is distributed 
as x7(r). Moreover, the random variables in the numerator and denominator 
are independent, because Mx and Pm, x, project on to mutually orthogonal 
subspaces: Mx Mı Xə = Mx (X2— Pı X2) = O. Thus it is apparent that the 
statistic (4.34) follows the F(r,n — k) distribution under the null hypothesis. 


A Threefold Orthogonal Decomposition 

Each of the restricted and unrestricted models generates an orthogonal de- 
composition of the dependent variable y. It is illuminating to see how these 
two decompositions interact to produce a threefold orthogonal decomposi- 
tion. It turns out that all three components of this decomposition have useful 
interpretations. From the two models, we find that 


y=Pyt+My and y= Pxy+ Mxy. (4.35) 
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In Exercise 2.17, it was seen that Px — P; is an orthogonal projection matrix, 
equal to Pm, xə. It follows that 


Px =P, + Pm, x, (4.36) 


where the two projections on the right-hand side are obviously mutually or- 
thogonal, since P, annihilates Mı X>. From (4.35) and (4.36), we obtain the 
threefold orthogonal decomposition 


y= Piy + Pm, xY + Mxy. (4.37) 


The first term is the vector of fitted values from the restricted model, Xı 8ı. In 
this and what follows, we use a tilde (~) to denote the restricted estimates, and 
a hat (^) to denote the unrestricted estimates. The second term is the vector 
of fitted values from the FWL regression (4.31). It equals MX 32, where, 
by the FWL Theorem, (2 is a subvector of estimates from the unrestricted 
model. Finally, Mx y is the vector of residuals from the unrestricted model. 


Since Pyy = Xj By + X5 Bo, the vector of fitted values from the unrestricted 
model, we see that 


X18, + X2 Bo = X18, + Mı Xz Bo. (4.38) 


In Exercise 4.9, this result is exploited to show how to obtain the restricted 
estimates in terms of the unrestricted estimates. 


The F statistic (4.33) can be written as the ratio of the squared norm of the 
second component in (4.37) to the squared norm of the third, each normalized 
by the appropriate number of degrees of freedom. Under both hypotheses, the 
third component Mx y equals Mxu, and so it consists of random noise. Its 
squared norm is a y?(n — k) variable times o”, which serves as the (unre- 
stricted) estimate of o? and can be thought of as a measure of the scale of 
the random noise. Since u ~ N(0,07I), every element of u has the same 
variance, and so every component of (4.37), if centered so as to leave only the 
random part, should have the same scale. 


Under the null hypothesis, the second component is Py,x,y = Pm, xu, 
which just consists of random noise. But, under the alternative, Pm, x,y = 
M,X2 32 + Pm, xu, and it thus contains a systematic part related to Xə. 
The length of the second component will be greater, on average, under the 
alternative than under the null, since the random part is there in all cases, but 
the systematic part is present only under the alternative. The F test compares 
the squared length of the second component with the squared length of the 
third. It thus serves to detect the possible presence of systematic variation, 
related to X2, in the second component of (4.37). 


All this means that we want to reject the null whenever the numerator of 
the F statistic, RSSR — USSR, is relatively large. Consequently, the P value 
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corresponding to a realized F statistic F is computed as 1 — Fp n- (EF ), where 
F, n—-k(-) denotes the CDF of the F distribution with the appropriate numbers 
of degrees of freedom. Thus we compute the P value as if for a one-tailed 
test. However, F tests are really two-tailed tests, because they test equality 
restrictions, not inequality restrictions. An F test for B2 = 0 will reject the 
null hypothesis whenever (32 is sufficiently far from 0, whether the individual 
elements of ice are positive or negative. 


There is a very close relationship between F tests and t tests. In the previous 
section, we saw that the square of a random variable with the t(n — k) distri- 
bution must have the F(1,n — k) distribution. The square of the t statistic 
tg,, defined in (4.25), is 


2 y' My 29(x2 Mix) x) Miy 
fa y'Mxy/(n—k) 


This test statistic is evidently a special case of (4.33), with the vector x 
replacing the matrix X2. Thus, when there is only one restriction, it makes 
no difference whether we use a two-tailed ¢ test or an F test. 


An Example of the F Test 

The most familiar application of the F test is testing the hypothesis that all 
the coefficients in a classical normal linear model, except the constant term, 
are zero. The null hypothesis is that G2 = 0 in the model 


y = Bye+ Xəß2+u, u~ N(0,07I), (4.39) 


where + is an n-vector of 1s and Xə is n x (k — 1). In this case, using (4.32), 
the test statistic (4.33) can be written as 


y'M,X2( X21 M, X2) XI My/(k — 1) 


i = , 4.40 
Ba (y’M.y = y'M, Xə(XI M, Xə) 1 XJ My) /(n —k) ( ) 


where M, is the projection matrix that takes deviations from the mean, which 
was defined in (2.32). Thus the matrix expression in the numerator of (4.40) 
is just the explained sum of squares, or ESS, from the FWL regression 


M,y = M,X2B2 + residuals. 


Similarly, the matrix expression in the denominator is the total sum of squares, 
or TSS, from this regression, minus the ESS. Since the centered R? from (4.39) 
is just the ratio of this ESS to this TSS, it requires only a little algebra to 
show that 

n—k R? 

k-i l-E 

Therefore, the F statistic (4.40) depends on the data only through the cen- 
tered R?, of which it is a monotonically increasing function. 


Fa, = 
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Testing the Equality of Two Parameter Vectors 


It is often natural to divide a sample into two, or possibly more than two, 
subsamples. These might correspond to periods of fixed exchange rates and 
floating exchange rates, large firms and small firms, rich countries and poor 
countries, or men and women, to name just a few examples. We may then 
ask whether a linear regression model has the same coefficients for both the 
subsamples. It is natural to use an F test for this purpose. Because the classic 
treatment of this problem is found in Chow (1960), the test is often called a 
Chow test; later treatments include Fisher (1970) and Dufour (1982). 


Let us suppose, for simplicity, that there are only two subsamples, of lengths 
nı and no, with n = ny + nə. We will assume that both nı and nə are 
greater than k, the number of regressors. If we separate the subsamples by 
partitioning the variables, we can write 


where yı and y2 are, respectively, an nı-vector and an n2-vector, while X, 
and Xə are nı xX k and ng x k matrices. Even if we need different para- 
meter vectors, Bı and (Go, for the two subsamples, we can nonetheless put the 
subsamples together in the following regression model: 


el- [kja] exon. aan 


It can readily be seen that, in the first subsample, the regression functions 
are the components of Xı68ı, while, in the second, they are the components 
of X2(G, + y). Thus y is to be defined as G2 — G,. If we define Z as an 
n x k matrix with O in its first nı rows and Xə in the remaining n2 rows, 
then (4.41) can be rewritten as 


y=XBi+Zy+u, u~N(0,071). (4.42) 


This is a regression model with n observations and 2k regressors. It has 
been constructed in such a way that 6ı is estimated directly, while (2 is 
estimated using the relation G2 = y + Bı. Since the restriction that 3, = B2 
is equivalent to the restriction that y = 0 in (4.42), the null hypothesis has 
been expressed as a set of k zero restrictions. Since (4.42) is just a classical 
normal linear model with k linear restrictions to be tested, the F test provides 
the appropriate way to test those restrictions. 


The F statistic can perfectly well be computed as usual, by running (4.42) 
to get the USSR and then running the restricted model, which is just the 
regression of y on X, to get the RSSR. However, there is another way to 
compute the USSR. In Exercise 4.10, readers are invited to show that it 
is simply the sum of the two SSRs obtained by running two independent 
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regressions on the two subsamples. If SSR; and SSRə denote the sums of 
squared residuals from these two regressions, and RSSR denotes the sum of 
squared residuals from regressing y on X, the F statistic becomes 


p — (RSSR — SSR: — SSR2)/k (4.43) 
Y (SSR; + SSRz)/(n — 2k) ` 


This Chow statistic, as it is often called, is distributed as F'(k,n — 2k) under 
the null hypothesis that 8; = Go. 


4.5 Large-Sample Tests in Linear Regression Models 


The t and F tests that we developed in the previous section are exact only 
under the strong assumptions of the classical normal linear model. If the 
error vector were not normally distributed or not independent of the matrix 
of regressors, we could still compute t and F statistics, but they would not 
actually follow their namesake distributions in finite samples. However, like 
a great many test statistics in econometrics which do not follow any known 
distribution exactly, they would in many cases approximately follow known 
distributions in large samples. In such cases, we can perform what are called 
large-sample tests or asymptotic tests, using the approximate distributions to 
compute P values or critical values. 


Asymptotic theory is concerned with the distributions of estimators and test 
statistics as the sample size n tends to infinity. It often allows us to obtain 
simple results which provide useful approximations even when the sample size 
is far from infinite. In this book, we do not intend to discuss asymptotic the- 
ory at the advanced level of Davidson (1994) or White (1984). A rigorous 
introduction to the fundamental ideas may be found in Gallant (1997), and a 
less formal treatment is provided in Davidson and MacKinnon (1993). How- 
ever, it is impossible to understand large parts of econometrics without having 
some idea of how asymptotic theory works and what we can learn from it. In 
this section, we will show that asymptotic theory gives us results about the 
distributions of t and F statistics under much weaker assumptions than those 
of the classical normal linear model. 


Laws of Large Numbers 


There are two types of fundamental results on which asymptotic theory is 
based. The first type, which we briefly discussed in Section 3.3, is called a law 
of large numbers, or LLN. A law of large numbers may apply to any quantity 
which can be written as an average of n random variables, that is, 1/n times 
their sum. Suppose, for example, that 


n 

eats oll 

t= Tt, 
t=1 
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Figure 4.6 EDFs for several sample sizes 


where the x; are independent random variables, each with its own bounded 
finite variance c? and with a common mean u. Then a fairly simple LLN 
assures us that, as n — oo, % tends to u. 


An example of how useful a law of large numbers can be is the Fundamental 
Theorem of Statistics, which concerns the empirical distribution function, 
or EDF, of a random sample. The EDF was introduced in Exercises 1.1 
and 3.4. Suppose that X is a random variable with CDF F(X) and that 
we obtain a random sample of size n with typical element x+, where each 
x, is an independent realization of X. The empirical distribution defined by 
this sample is the discrete distribution that puts a weight of 1/n at each of 
the a, t = 1,...,n. The EDF is the distribution function of the empirical 
distribution, and it can be expressed algebraically as 


F(x) =+ > I(x, < 2), (4.44) 


where I(-) is the indicator function, which takes the value 1 when its argument 
is true and takes the value 0 otherwise. Thus, for a given argument x, the 
sum on the right-hand side of (4.44) counts the number of realizations x; that 
are smaller than or equal to x. The EDF has the form of a step function: The 
height of each step is 1/n, and the width is equal to the difference between two 
successive values of x,. According to the Fundamental Theorem of Statistics, 
the EDF consistently estimates the CDF of the random variable X. 
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Figure 4.6 shows the EDFs for three samples of sizes 20, 100, and 500 drawn 
from three normal distributions, each with variance 1 and with means 0, 2, 
and 4, respectively. These may be compared with the CDF of the standard 
normal distribution in the lower panel of Figure 4.2. There is not much 
resemblance between the EDF based on n = 20 and the normal CDF from 
which the sample was drawn, but the resemblance is somewhat stronger for 
n = 100 and very much stronger for n = 500. It is a simple matter to 
simulate data from an EDF, as we will see in the next section, and this type 
of simulation can be very useful. 


It is very easy to prove the Fundamental Theorem of Statistics. For any real 
value of x, each term in the sum on the right-hand side of (4.44) depends only 
on x;. The expectation of I(x, < x) can be found by using the fact that it 
can take on only two values, 1 and 0. The expectation is 


E(I(x, < £)) = 0- Pr(I(x;, <2) =0) +1- Pr(I(x, < x)= 1) 
= Pr(I(x;, < x) = 1) = Pr(z; < x) = F(z). 


Since the x, are mutually independent, so too are the terms I(x; < x). Since 
the z; all follow the same distribution, so too must these terms. Thus (4.44) is 
the mean of n IID random terms, each with finite expectation. The simplest 
of all LLNs (due to Khinchin) applies to such a mean, and we conclude that, 
for every x, F(x) is a consistent estimator of F(x). 


There are many different LLNs, some of which do not require that the indi- 
vidual random variables have a common mean or be independent, although 
the amount of dependence must be limited. If we can apply a LLN to any 
random average, we can treat it as a nonrandom quantity for the purpose of 
asymptotic analysis. In many cases, this means that we must divide the quan- 
tity of interest by n. For example, the matrix X'X that appears in the OLS 
estimator generally does not converge to anything as n — co. In contrast, 
the matrix n~1X'X will, under many plausible assumptions about how X is 
generated, tend to a nonstochastic limiting matrix Syrty as n — oo. 


Central Limit Theorems 


The second type of fundamental result on which asymptotic theory is based 
is called a central limit theorem, or CLT. Central limit theorems are crucial 
in establishing the asymptotic distributions of estimators and test statistics. 
They tell us that, in many circumstances, 1/,/n times the sum of n centered 
random variables will approximately follow a normal distribution when n is 
sufficiently large. 


Suppose that the random variables x, t = 1,...,n, are independently and 
identically distributed with mean u and variance o?. Then, according to the 
Lindeberg-Lévy central limit theorem, the quantity 


1 t= 
= 4.45 
aa as 
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is asymptotically distributed as N(0,1). This means that, as n — oo, the 
random variable z tends to a random variable which follows the N(0,1) dis- 
tribution. It may seem curious that we divide by yn instead of by n in (4.45), 
but this is an essential feature of every CLT. To see why, we calculate the var- 
iance of z. Since the terms in the sum in (4.45) are independent, the variance 
of z is just the sum of the variances of the n terms: 


l z= n 
Var(z)=n ar( =e = ) = 


If we had divided by n, we would, by a law of large numbers, have obtained a 
random variable with a plim of 0 instead of a random variable with a limiting 
standard normal distribution. Thus, whenever we want to use a CLT, we 
must ensure that a factor of n~!/? = 1/,/n is present. 


Just as there are many different LLNs, so too are there many different CLTs, 
almost all of which impose weaker conditions on the x; than those imposed 
by the Lindeberg-Lévy CLT. The assumption that the x; are identically dis- 
tributed is easily relaxed, as is the assumption that they are independent. 
However, if there is either too much dependence or too much heterogeneity, 
a CLT may not apply. Several CLTs are discussed in Section 4.7 of David- 
son and MacKinnon (1993), and Davidson (1994) provides a more advanced 
treatment. In all cases of interest to us, the CLT says that, for a sequence of 
random variables z+, t = 1,...,00, with E(x,) = 0, 


plim n7\/? > Lt = To ~ N(0, Jim 4 `> Var(2,)). 
t=1 


N— CO t=1 


We sometimes need vector, or multivariate, versions of CLTs. Suppose that we 
have a sequence of random m-vectors a;, for some fixed m, with E(a;) = 0. 
Then the appropriate multivariate version of a CLT tells us that 


plim n7"? ‘2 Ti = To ~ n(o, Jim, = ` Var (æ+)); (4.46) 
t=1 


n— CO t=1 


where zo is multivariate normal, and each Var(æ+) is an m x m matrix. 


Figure 4.7 illustrates the fact that CLTs often provide good approximations 
even when n is not very large. Both panels of the figure show the densities 
of various random variables z defined as in (4.45). In the top panel, the x 
are uniformly distributed, and we see that z is remarkably close to being 
distributed as standard normal even when n is as small as 8. This panel does 
not show results for larger values of n because they would have made it too 
hard to read. In the bottom panel, the x; follow the y?(1) distribution, which 
exhibits extreme right skewness. The mode of the distribution is 0, there are 
no values less than 0, and there is a very long right-hand tail. For n = 4 
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Figure 4.7 The normal approximation for different values of n 


and n = 8, the standard normal provides a poor approximation to the actual 
distribution of z. For n = 100, on the other hand, the approximation is not 
bad at all, although it is still noticeably skewed to the right. 


Asymptotic Tests 


The t and F tests that we discussed in the previous section are asymptotically 
valid under much weaker conditions than those needed to prove that they 
actually have their namesake distributions in finite samples. Suppose that 
the DGP is 

y=XBotu, u~ IID(0, 81), (4.47) 


where Jo satisfies whatever hypothesis is being tested, and the error terms 
are drawn from some specific but unknown distribution with mean 0 and 
variance a2. We allow X; to contain lagged dependent variables, and so we 
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abandon the assumption of exogenous regressors and replace it with assump- 
tion (3.10) from Section 3.2, plus an analogous assumption about the variance. 
These two assumptions can be written as 


E(u,;|X;)=0 and E(u?| X;) = o@. (4.48) 


The first of these assumptions, which is assumption (3.10), can be referred 
to in two ways. From the point of view of the error terms, it says that they 
are innovations. An innovation is a random variable of which the mean is 0 
conditional on the information in the explanatory variables, and so knowledge 
of the values taken by the latter is of no use in predicting the mean of the in- 
novation. From the point of view of the explanatory variables X;, assumption 
(3.10) says that they are predetermined with respect to the error terms. We 
thus have two different ways of saying the same thing. Both can be useful, 
depending on the circumstances. 


Although we have greatly weakened the assumptions of the classical normal 
linear model, we now need to make an additional assumption in order to be 
able to use asymptotic results. We therefore assume that the data-generating 
process for the explanatory variables is such that 


plim XTX = Sxrx, (4.49) 


n— Co 


where Syrty is a finite, deterministic, positive definite matrix. We made this 
assumption previously, in Section 3.3, when we proved that the OLS estimator 
is consistent. Although it is often reasonable, condition (4.49) is violated in 
many cases. For example, it cannot hold if one of the columns of the X matrix 
is a linear time trend, because Yr grows at a rate faster than n. 


Now consider the ¢ statistic (4.25) for testing the hypothesis that G2 = 0 in 
the model (4.21). The key to proving that (4.25), or any test statistic, has 
a certain asymptotic distribution is to write it as a function of quantities to 
which we can apply either a LLN or a CLT. Therefore, we rewrite (4.25) as 


TM. —1/2 —1/2 TM. 
ta = (2 xr) ce ale (4.50) 


n—k n—1 gol My 22)!/2’ 


where the numerator and denominator of the second factor have both been 
multiplied by n~!/?. Under the DGP (4.47), s? = y'Mxy/(n—k) tends to o2 
as n — oo. This statement, which is equivalent to saying that the OLS error 
variance estimator s? is consistent under our weaker assumptions, follows from 
a LLN, because s? has the form of an average, and the calculations leading 
to (3.49) showed that the mean of s? is 02. It follows from the consistency 
of s? that the first factor in (4.50) tends to 1/a9 as n — oo. When the data 
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are generated by (4.47) with 32 = 0, we have that Mıy = Mju, and so (4.50) 
is asymptotically equivalent to 
n— /2 go! Miu 
oo(n-! ag! Mi a2)!/2 


(4.51) 


It is now easy to derive the asymptotic distribution of tg, if for a moment we 
reinstate the assumption that the regressors are exogenous. In that case, we 
can work conditionally on X, which means that the only part of (4.51) that 
is treated as random is u. The numerator of (4.51) is n~!/? times a weighted 
sum of the u+, each of which has mean 0, and the conditional variance of this 
weighted sum is 


E(x} M,uu' Mı zo x= og xo Mı £2. 


Thus (4.51) evidently has mean 0 and variance 1, conditional on X. But 
since 0 and 1 do not depend on X, these are also the unconditional mean 
and variance of (4.51). Provided that we can apply a CLT to the numerator 
of (4.51), the numerator of tg, must be asymptotically normally distributed, 
and we conclude that, under the null hypothesis, with exogenous regressors, 


tg, ~ N(0,1). (4.52) 


The notation “~” means that tg, is asymptotically distributed as N(0, 1). 
Since the DGP is assumed to be (4.47), this result does not require that the 


error terms be normally distributed. 


The t Test with Predetermined Regressors 


If we relax the assumption of exogenous regressors, the analysis becomes more 
complicated. Readers not interested in the algebraic details may well wish to 
skip to next section, since what follows is not essential for understanding the 
rest of this chapter. However, this subsection provides an excellent example 
of how asymptotic theory works, and it illustrates clearly just why we can 
relax some assumptions but not others. 


We begin by applying a CLT to the k-vector 


ven 2X u an? ` u Xi. (4.53) 
t=1 


By (3.10), E(u; |X;) = 0. This implies that E(u,;X;) = 0, as required for 
the CLT, which then tells us that 


ve N(0, tim 1 >> Var (we X7) = N(0, lim l YO E(u? XTX); 
t=1 t=1 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


154 Hypothesis Testing in Linear Regression Models 


recall (4.46). Notice that, because X; is a 1 x k row vector, the covariance 
matrix here is k x k, as it must be. The second assumption in (4.48) allows 
us to simplify the limiting covariance matrix: 


Jim $9 Bl up Xi Xz) = lim obz Ly E(x Xz) 


t=1 


= og plim = 3 X! X: (4.54) 


n— co 


= of plim L xT = of Sxtx. 
We applied a LLN in reverse to go from the first line to the second, and the 
last equality follows from (4.49). 


Now consider the numerator of (4.51). It can be written as 
ngeru — n æd Pru. (4.55) 


The first term of this expression is just the last, or kt®, component of v, which 
we can denote by v2. By writing out the projection matrix P, explicitly, and 
dividing various expressions by n in a way that cancels out, the second term 
can be rewritten as 


nla X(n tX X) nV Xiu. (4.56) 


By assumption (4.49), the first and second factors of (4.56) tend to determin- 
istic limits. In obvious notation, the first tends to S21, which is a submatrix 
of Sxtx, and the second tends to SJ} , which is the inverse of a submatrix 
of Sxtx. Thus only the last factor remains random when n — oo. It is just 
the subvector of v consisting of the first k — 1 components, which we denote 
by vı. Asymptotically, in partitioned matrix notation, (4.55) becomes 


= So Sp v = [ -S2 S 1] be | . 


U2 


Since v is asymptotically multivariate normal, this scalar expression is asymp- 
totically normal, with mean zero and variance 


=S | 


o9[-SuSy, 1] sxx] i 


where, since Sxtx is symmetric, S12 is just the transpose of S21. If we now 
express Syrty as a partitioned matrix, the variance of (4.55) is seen to be 


5 Sa Sal -Srs 
el-Susit jgn se| On 
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The denominator of (4.51) is, thankfully, easier to analyze. The square of the 
second factor is 


nh aol Mixo = n ta T2 — ntra Pizo 


= nao as — ntx Xi (a GY w Xian 
In the limit, all the pieces of this expression become submatrices of SxTtx, 
and so we find that 


zia T —1 
n T2 Mixo = Soo = S153) Sip. 


When it is multiplied by o?, this is just (4.57), the variance of the numerator 
of (4.51). Thus, asymptotically, we have shown that tg, is the ratio of a normal 
random variable with mean zero to its standard deviation. Consequently, we 
have established that, under the null hypothesis, with regressors that are not 
necessarily exogenous but merely predetermined, tg, SN (0,1). This result is 
what we previously obtained as (4.52) when we assumed that the regressors 
were exogenous. 


Asymptotic F Tests 


A similar analysis can be performed for the F statistic (4.33) for the null 
hypothesis that G2 = 0 in the model (4.28). Under the null, Fg, is equal to 
expression (4.34), which can be rewritten as 


n!e Mı Xə(n X} Mı Xə) tn"? X} Mie /r 
e!Mxe/(n— k) ? 


(4.58) 


where e = w/o. It is not hard to use the results we obtained for the t statistic 
to show that, as n — œo, 
rFa, ~x lr) (4.59) 


under the null hypothesis; see Exercise 4.12. Since 1/r times a random vari- 
able that follows the x?(r) distribution is distributed as F (r, o0), we can also 
conclude that Fg, © F(r,n — k). 


The results (4.52) and (4.59) justify the use of t and F tests outside the 
confines of the classical normal linear model. We can compute P values using 
either the standard normal or t distributions in the case of t statistics, and 
either the y? or F distributions in the case of F statistics. Of course, if we 
use the y? distribution, we have to multiply the F statistic by r. 


Whatever distribution we use, these P values will be approximate, and tests 
based on them will not be exact in finite samples. In addition, our theoretical 
results do not tell us just how accurate they will be. If we decide to use a 
nominal level of a for a test, we will reject if the approximate P value is 
less than a. In many cases, but certainly not all, such tests will probably be 
quite accurate, committing Type I errors with probability reasonably close 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


156 Hypothesis Testing in Linear Regression Models 


to a. They may either overreject, that is, reject the null hypothesis more 
than 100a% of the time when it is true, or underreject, that is, reject the 
null hypothesis less than 100a% of the time. Whether they will overreject 
or underreject, and how severely, will depend on many things, including the 
sample size, the distribution of the error terms, the number of regressors 
and their properties, and the relationship between the error terms and the 
regressors. 


4.6 Simulation-Based Tests 


When we introduced the concept of a test statistic in Section 4.2, we specified 
that it should have a known distribution under the null hypothesis. In the 
previous section, we relaxed this requirement and developed large-sample test 
statistics for which the distribution is known only approximately. In all the 
cases we have studied, the distribution of the statistic under the null hypo- 
thesis was not only (approximately) known, but also the same for all DGPs 
contained in the null hypothesis. This is a very important property, and it is 
useful to introduce some terminology that will allow us to formalize it. 


We begin with a simple remark. A hypothesis, null or alternative, can always 
be represented by a model, that is, a set of DGPs. For instance, the null and 
alternative hypotheses (4.29) and (4.28) associated with an F test of several 
restrictions are both classical normal linear models. The most fundamental 
sort of null hypothesis that we can test is a simple hypothesis. Such a hypo- 
thesis is represented by a model that contains one and only one DGP. Simple 
hypotheses are very rare in econometrics. The usual case is that of a com- 
pound hypothesis, which is represented by a model that contains more than 
one DGP. This can cause serious problems. Except in certain special cases, 
such as the exact tests in the classical normal linear model that we investi- 
gated in Section 4.4, a test statistic will have different distributions under the 
different DGPs contained in the model. In such a case, if we do not know 
just which DGP in the model generated our data, then we cannot know the 
distribution of the test statistic. 


If a test statistic is to have a known distribution under some given null hy- 
pothesis, then it must have the same distribution for each and every DGP 
contained in that null hypothesis. A random variable with the property that 
its distribution is the same for all DGPs in a model M is said to be pivotal, 
or to be a pivot, for the model M. The distribution is allowed to depend on 
the sample size, and perhaps on the observed values of exogenous variables. 
However, for any given sample size and set of exogenous variables, it must be 
invariant across all DGPs in M. Note that all test statistics are pivotal for a 
simple null hypothesis. 


The large sample tests considered in the last section allow for null hypotheses 
that do not respect the rigid constraints of the classical normal linear model. 
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The price they pay for this added generality is that t and F statistics now 
have distributions that depend on things like the error distribution: They are 
therefore not pivotal statistics. However, their asymptotic distributions are 
independent of such things, and are thus invariant across all the DGPs of 
the model that represents the null hypothesis. Such statistics are said to be 
asymptotically pivotal, or asymptotic pivots, for that model. 


Simulated P Values 


The distributions of the test statistics studied in Section 4.3 are all thoroughly 
known, and their CDFs can easily be evaluated by computer programs. The 
computation of P values is therefore straightforward. Even if it were not, 
we could always estimate them by simulation. For any pivotal test statistic, 
the P value can be estimated by simulation to any desired level of accuracy. 
Since a pivotal statistic has the same distribution for all DGPs in the model 
under test, we can arbitrarily choose any such DGP for generating simulated 
samples and simulated test statistics. 


The theoretical justification for using simulation to estimate P values is the 
Fundamental Theorem of Statistics, which we discussed in Section 4.5. It 
tells us that the empirical distribution of a set of independent drawings of a 
random variable generated by some DGP converges to the true CDF of the 
random variable under that DGP. This is just as true of simulated drawings 
generated by the computer as for random variables generated by a natural 
random mechanism. Thus, if we knew that a certain test statistic was pivotal 
but did not know how it was distributed, we could select any DGP in the 
null model and generate simulated samples from it. For each of these, we 
could then compute the test statistic. If the simulated samples are mutually 
independent, the set of simulated test statistics thus generated constitutes a 
set of independent drawings from the distribution of the test statistic, and 
their EDF is a consistent estimate of the CDF of that distribution. 


Suppose that we have computed a test statistic 7, which could be a t statistic, 
an F statistic, or some other type of test statistic, using some data set with n 
observations. We can think of 7 as being a realization of a random variable rT. 
We wish to test a null hypothesis represented by a model M for which 7 is 
pivotal, and we want to reject the null whenever 7 is sufficiently large, as in the 
cases of an F statistic, a t statistic when the rejection region is in the upper 
tail, or a squared t statistic. If we denote by F the CDF of the distribution 
of T under the null hypothesis, the P value for a test based on 7 is 


p(7) =1- F(*). (4.60) 


Since 7 is computed directly from our original data, this P value can be 
estimated if we can estimate the CDF F evaluated at 7. 


The procedure we are about to describe is very general in its application, and 
so we describe it in detail. In order to estimate a P value by simulation, 
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we choose any DGP in M, and draw B samples of size n from it. How 
to choose B will be discussed shortly; it will typically be rather large, and 
B = 999 may often be a reasonable choice. We denote the simulated samples 
as yž, j = 1,...,B. The star (*) notation will be used systematically to 
denote quantities generated by simulation. B is used to denote the number of 
simulations in order to emphasize the connection with the bootstrap, which 
we will discuss below. 


Using the simulated sample, for each 7 we compute a simulated test statistic, 
say T*, in exactly the same way that 7 was computed from the original data y. 
We can then construct the EDF of the 77 analogously to (4.44): 


B B 
PA=- FP =1-F UG sA= ZV MG >A). (461) 


The third equality in (4.61) can be understood by noting that the rightmost 
expression is the proportion of simulations for which T} is greater than 7, while 
the second expression from the right is 1 minus the proportion for which 77 
is less than or equal to 7. These proportions are obviously the same. 


A 


We can see that p*(7) must lie between 0 and 1, as any P value must. For 
example, if B = 999, and 36 of the 7} were greater than 7, we would have 
p* (7) = 36/999 = 0.036. In this case, since p*(7) is less than 0.05, we would 
reject the null hypothesis at the .05 level. Since the EDF converges to the true 
CDF, it follows that, if B were infinitely large, this procedure would yield an 
exact test, and the outcome of the test would be the same as if we computed 
the P value analytically using the CDF of 7. In fact, as we will see shortly, 


this procedure will yield an exact test even for certain finite values of B. 


The sort of test we have just described, based on simulating a pivotal sta- 
tistic, is called a Monte Carlo test. Simulation experiments in general are 
often referred to as Monte Carlo experiments, because they involve generat- 
ing random numbers, as do the games played in casinos. Around the time that 
computer simulations first became possible, the most famous casino was the 
one in Monte Carlo. If computers had been developed just a little later, we 
would probably be talking now of Las Vegas tests and Las Vegas experiments. 


Random Number Generators 


Drawing a simulated sample of size n requires us to generate at least n random, 
or pseudo-random, numbers. As we mentioned in Section 1.3, a random 
number generator, or RNG, is a program for generating random numbers. 
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Most such programs generate numbers that appear to be drawings from the 
uniform U(0,1) distribution, which can then be transformed into drawings 
from other distributions. There is a large literature on RNGs, to which Press 
et al. (1992a, 1992b, Chapter 7) provides an accessible introduction. See also 
Knuth (1998, Chapter 3) and Gentle (1998). 


Although there are many types of RNG, the most common are variants of the 
linear congruential generator, 


zi = àzi-1 +c [mod m], n=—, i=1,2,..., (4.62) 


where n; is the it? random number generated, and m, À, c, and so also the z;, 
are positive integers. The notation [mod m] means that we divide what pre- 
cedes it by m and retain the remainder. This generator starts with a (generally 
large) positive integer zo called the seed, multiplies it by À, and then adds c 
to obtain an integer that may well be bigger than m. It then obtains zı as 
the remainder from division by m. To generate the next random number, the 
process is repeated with zı replacing zp, and so on. At each stage, the actual 
random number output by the generator is z;/m, which, since 0 < z; < m, 
lies in the interval [0,1]. For a given generator defined by A, m, and c, the 
sequence of random numbers depends entirely on the seed. If we provide the 
generator with the same seed, we will get the same sequence of numbers. 


How well or badly this procedure works depends on how A, m, and c are 
chosen. On 32-bit computers, many commonly used generators set c = 0 and 
use for m a prime number that is either a little less than 2°? or a little less than 
231, When c = 0, the generator is said to be multiplicative congruential. The 
parameter A, which will be large but substantially smaller than m, must be 
chosen so as to satisfy some technical conditions. When À and m are chosen 
properly with c = 0, the RNG will have a period of m — 1. This means that 
it will generate every rational number with denominator m between 1/m and 
(m —1)/m precisely once until, after m — 1 steps, zọ comes up again. After 
that, the generator repeats itself, producing the same m — 1 numbers in the 
same order each time. 


Unfortunately, many random number generators, whether or not they are of 
the linear congruential variety, perform poorly. The random numbers they 
generate may fail to be independent in all sorts of ways, and the period may 
be relatively short. In the case of multiplicative congruential generators, this 
means that A and m have not been chosen properly. See Gentle (1998) and 
the other references cited above for discussion of bad random number genera- 
tors. Toy examples of multiplicative congruential generators are examined in 
Exercise 4.13, where the choice of À and m is seen to matter. 


There are several ways to generate drawings from a normal distribution if we 
can generate random numbers from the U(0,1) distribution. The simplest, 
but not the fastest, is to use the fact that, if 7; is distributed as U(0,1), then 
6~!(n;) is distributed as N (0, 1); this follows from the result of Exercise 4.14. 
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Most of the random number generators available in econometrics software 
packages use faster algorithms to generate drawings from the standard normal 
distribution, usually in a way entirely transparent to the user, who merely 
has to ask for so many independent drawings from N(0,1). Drawings from 
N(u, 07) can then be obtained by use of the formula (4.09). 


Bootstrap Tests 


Although pivotal test statistics do arise from time to time, most test statis- 
tics in econometrics are not pivotal. The vast majority of them are, however, 
asymptotically pivotal. If a test statistic has a known asymptotic distribution 
that does not depend on anything unobservable, as do t and F statistics under 
the relatively weak assumptions of Section 4.5, then it is certainly asymptot- 
ically pivotal. Even if it does not follow a known asymptotic distribution, a 
test statistic may be asymptotically pivotal. 


A statistic that is not an exact pivot cannot be used for a Monte Carlo test. 
However, approximate P values for statistics that are only asymptotically 
pivotal, or even nonpivotal, can be obtained by a simulation method called 
the bootstrap. This method can be a valuable alternative to the large sample 
tests based on asymptotic theory that we discussed in the previous section. 
The term bootstrap, which was introduced to statistics by Efron (1979), is 
taken from the phrase “to pull oneself up by one’s own bootstraps.” Although 
the link between this improbable activity and simulated P values is tenuous 
at best, the term is by now firmly established. We will speak of bootstrapping 
in order to obtain bootstrap samples, from which we compute bootstrap test 
statistics that we use to perform bootstrap tests on the basis of bootstrap 
P values, and so on. 


The difference between a Monte Carlo test and a bootstrap test is that for 
the former, the DGP is assumed to be known, whereas, for the latter, it is 
necessary to estimate a bootstrap DGP from which to draw the simulated 
samples. Unless the null hypothesis under test is a simple hypothesis, the 
DGP that generated the original data is unknown, and so it cannot be used 
to generate simulated data. The bootstrap DGP is an estimate of the unknown 
true DGP. The hope is that, if the bootstrap DGP is close, in some sense, 
to the true one, then data generated by the bootstrap DGP will be similar to 
data that would have been generated by the true DGP, if it were known. If 
so, then a simulated P value obtained by use of the bootstrap DGP will be 
close enough to the true P value to allow accurate inference. 


Even for models as simple as the linear regression model, there are many 
ways to specify the bootstrap DGP. The key requirement is that it should 
satisfy the restrictions of the null hypothesis. If this is assured, then how well a 
bootstrap test performs in finite samples depends on how good an estimate the 
bootstrap DGP is of the process that would have generated the test statistic 
if the null hypothesis were true. In the next subsection, we discuss bootstrap 
DGPs for regression models. 
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Bootstrap DGPs for Regression Models 


If the null and alternative hypotheses are regression models, the simplest 
approach is to estimate the model that corresponds to the null hypothesis 
and then use the estimates to generate the bootstrap samples, under the 
assumption that the error terms are normally distributed. We considered 
examples of such procedures in Section 1.3 and in Exercise 1.22. 


Since bootstrapping is quite unnecessary in the context of the classical normal 
linear model, we will take for our example a linear regression model with 
normal errors, but with a lagged dependent variable among the regressors: 


y= Xib + Ziy + oy-1t+u, ue ~ NID(0, 0°), (4.63) 


where X; and 8 each have kı — 1 elements, Z; and y each have kə elements, 
and the null hypothesis is that y = 0. Thus the model that represents the 
null is 

yt = Xib + y1 +u, u ~ NID(0, o°). (4.64) 


The observations are assumed to be indexed in such a way that yo is observed, 
along with n observations on y;, X;, and Z; for t = 1,...,n. By estimating 
the models (4.63) and (4.64) by OLS, we can compute the F statistic for 
7 = 0, which we will call 7. Because the regression function contains a lagged 
dependent variable, however, the F test based on 7 will not be exact. 


The model (4.64) is a fully specified parametric model, which means that 
each set of parameter values for 6, ô, and o? defines just one DGP. The 
simplest type of bootstrap DGP for fully specified models is given by the 
parametric bootstrap. The first step in constructing a parametric bootstrap 
DGP is to estimate (4.64) by OLS, yielding the restricted estimates 6, 6, and 
32 = SSR(G, 0) /(n — kı). Then the bootstrap DGP is given by 


yt = Xið + Šu + už, ut ~ NID(0, 8), (4.65) 
which is just the element of the model (4.64) characterized by the parameter 


estimates under the null, with stars to indicate that the data are simulated. 


In order to draw a bootstrap sample from the bootstrap DGP (4.65), we first 
draw an n-vector u* from the N (0, 3°T) distribution. The presence of a lagged 
dependent variable implies that the bootstrap samples must be constructed 
recursively. This is necessary because yž, the tt? element of the bootstrap 
sample, must depend on yf_, and not on y,_; from the original data. The 
recursive rule for generating a bootstrap sample is 


Vi = X18 + õyo + už 


ys = X28 + dys + už 
(4.66) 


yx = X,34+ dyt_, + ut. 
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Notice that every bootstrap sample is conditional on the observed value of yo. 
There are other ways of dealing with pre-sample values of the dependent 
variable, but this is certainly the most convenient, and it may, in many cir- 
cumstances, be the only method that is feasible. 


The rest of the procedure for computing a bootstrap P value is identical to 
the one for computing a simulated P value for a Monte Carlo test. For each 
of the B bootstrap samples, Yj, a bootstrap test statistic T* is computed 
from y} in just the same way as 7 was computed from the original data, y. 
The bootstrap P value p*(7) is then computed by formula (4.61). 


A Nonparametric Bootstrap DGP 


The parametric bootstrap procedure that we have just described, based on the 
DGP (4.65), does not allow us to relax the strong assumption that the error 
terms are normally distributed. How can we construct a satisfactory bootstrap 
DGP if we extend the models (4.63) and (4.64) to admit nonnormal errors? If 
we knew the true error distribution, whether or not it was normal, we could 
always generate the u* from it. Since we do not know it, we will have to find 
some way to estimate this distribution. 


Under the null hypothesis, the OLS residual vector ù for the restricted model 
is a consistent estimator of the error vector u. This is an immediate conse- 
quence of the consistency of the OLS estimator itself. In the particular case 
of model (4.64), we have for each t that 


plim ŭ = plim (ye = XB = Ôye—1) = yt — Xi Bo — ÔoYt-1 = Ut, 


n—> CO n—> oo 


where Jp and dg are the parameter values for the true DGP. This means that, 
if the uz are mutually independent drawings from the error distribution, then 
so are the residuals ù, asymptotically. 


From the Fundamental Theorem of Statistics, we know that the empirical dis- 
tribution function of the error terms is a consistent estimator of the unknown 
CDF of the error distribution. Because the residuals consistently estimate the 
errors, it follows that the EDF of the residuals is also a consistent estimator 
of the CDF of the error distribution. Thus, if we draw bootstrap error terms 
from the empirical distribution of the residuals, we are drawing them from 
a distribution that tends to the true error distribution as n — oo. This is 
completely analogous to using estimated parameters in the bootstrap DGP 
that tend to the true parameters as n — oo. 


Drawing simulated error terms from the empirical distribution of the residuals 
is called resampling. In order to resample the residuals, all the residuals are, 
metaphorically speaking, thrown into a hat and then randomly pulled out one 
at a time, with replacement. Thus each bootstrap sample will contain some 
of the residuals exactly once, some of them more than once, and some of them 
not at all. Therefore, the value of each drawing must be the value of one of 
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the residuals, with equal probability for each residual. This is precisely what 
we mean by the empirical distribution of the residuals. 


To resample concretely rather than metaphorically, we can proceed as follows. 
First, we draw a random number 7 from the U(0,1) distribution. Then we 
divide the interval [0,1] into n subintervals of length 1/n and associate each 
of these subintervals with one of the integers between 1 and n. When 7 falls 
into the /*" subinterval, we choose the index J, and our random drawing is the 
It residual. Repeating this procedure n times yields a single set of bootstrap 
error terms drawn from the empirical distribution of the residuals. 


As an example of how resampling works, suppose that n = 10, and the ten 
residuals are 


6.45, 1.28, —3.48, 2.44, —5.17, —1.67, —2.03, 3.58, 0.74, —2.14. 


Notice that these numbers sum to zero. Now suppose that, when forming 
one of the bootstrap samples, the ten drawings from the U(0,1) distribution 
happen to be 


0.631, 0.277, 0.745, 0.202, 0.914, 0.136, 0.851, 0.878, 0.120, 0.259. 
This implies that the ten index values will be 
7, 3, 8, 3, 10, 2, 9, 9, 2, 3. 
Therefore, the error terms for this bootstrap sample will be 
—2.03, —3.48, 3.58, —3.48, —2.14, 1.28, 0.74, 0.74, 1.28, —3.48. 


Some of the residuals appear just once in this particular sample, some of them 
(numbers 2, 3, and 9) appear more than once, and some of them (numbers 1, 
4, 5, and 6) do not appear at all. On average, however, each of the residuals 
will appear once in each of the bootstrap samples. 


If we adopt this resampling procedure, we can write the bootstrap DGP as 
y= Xið + yı +u, uy ~ EDF(ù), (4.67) 


where EDF (ù) denotes the distribution that assigns probability t/n to each 
of the elements of the residual vector u. The DGP (4.67) is one form of what 
is usually called a nonparametric bootstrap, although, since it still uses the 
parameter estimates 8 and ô, it should really be called semiparametric rather 
than nonparametric. Once bootstrap error terms have been drawn by resam- 
pling, bootstrap samples can be created by the recursive procedure (4.66). 


The empirical distribution of the residuals may fail to satisfy some of the 
properties that the null hypothesis imposes on the true error distribution, and 
so the DGP (4.67) may fail to belong to the null hypothesis. One case in which 
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this failure has grave consequences arises when the regression (4.64) does not 
contain a constant term, because then the sample mean of the residuals is 
not, in general, equal to 0. The expectation of the EDF of the residuals is 
simply their sample mean; recall Exercise 1.1. Thus, if the bootstrap error 
terms are drawn from a distribution with nonzero mean, the bootstrap DGP 
lies outside the null hypothesis. It is, of course, simple to correct this problem. 
We just need to center the residuals before throwing them into the hat, by 
subtracting their mean u. When we do this, the bootstrap errors are drawn 
from EDF(u — ue), a distribution that does indeed have mean 0. 


A somewhat similar argument gives rise to an improved bootstrap DGP. If 
the sample mean of the restricted residuals is 0, then the variance of their 
empirical distribution is the second moment n~')*/_, 7. Thus, by using 
the definition (3.49) of 5? in Section 3.6, we see that the variance of the 
empirical distribution of the residuals is §2(n — kı)/n. Since we do not know 
the value of oĉ, we cannot draw from a distribution with exactly that variance. 
However, as with the parametric bootstrap (4.65), we can at least draw from 
a distribution with variance 8. This is easy to do by drawing from the EDF 
of the rescaled residuals, which are obtained by multiplying the OLS residuals 
by (n/(n—k,))'/2. If we resample these rescaled residuals, the bootstrap error 


distribution is 1/2 
n pee 
EDF ( 7 ) i), (4.68) 


n — Ki 


which has variance 3°. A somewhat more complicated approach, based on the 
result (3.44), is explored in Exercise 4.15. 


Although they may seem strange, these resampling procedures often work 
astonishingly well, except perhaps when the sample size is very small or the 
distribution of the error terms is very unusual; see Exercise 4.18. If the 
distribution of the error terms displays substantial skewness (that is, a nonzero 
third moment) or excess kurtosis (that is, a fourth moment greater than 306), 
then there is a good chance that the EDF of the recentered and rescaled 
residuals will do so as well. 


Other methods for bootstrapping regression models nonparametrically and 
semiparametrically are discussed by Efron and Tibshirani (1993), Davison 
and Hinkley (1997), and Horowitz (2001), which also discuss many other 
aspects of the bootstrap. A more advanced book, which deals primarily with 
the relationship between asymptotic theory and the bootstrap, is Hall (1992). 


How Many Bootstraps? 

Suppose that we wish to perform a bootstrap test at level œ. Then B should 
be chosen to satisfy the condition that a(B + 1) is an integer. If a = .05, the 
values of B that satisfy this condition are 19, 39, 59, and so on. If a = .01, 
they are 99, 199, 299, and so on. It is illuminating to see why B should be 
chosen in this way. 
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Imagine that we sort the original test statistic 7 and the B bootstrap sta- 
tistics Tř, j = 1,...,B, in decreasing order. If 7 is pivotal, then, under the 
null hypothesis, these are all independent drawings from the same distribu- 
tion. Thus the rank r of 7 in the sorted set can have B +1 possible values, 
r = 0,1,...,B, all of them equally likely under the null hypothesis if 7 is 
pivotal. Here, r is defined in such a way that there are exactly r simulations 
for which r* > 7. Thus, if r = 0, 7 is the largest value in the set, and if r = B, 
it is the smallest. The estimated P value p*(7) is just r/B. 


The bootstrap test rejects if r/B < a, that is, if r < aB. Under the null, 
the probability that this inequality will be satisfied is the proportion of the 
B +1 possible values of r that satisfy it. If we denote by [aB] the largest 
integer that is smaller than aB, it is easy to see that there are exactly [aB]+1 
such values of r, namely, 0,1,...,{a@B]. Thus the probability of rejection is 
([aB] + 1)/(B + 1). If we equate this probability to a, we find that 


a(B+1)=[aB]+4+1. 


Since the right-hand side of this equality is the sum of two integers, this 
equality can hold only if a(B+1) is an integer. Moreover, it will hold whenever 
a(B + 1) is an integer. Therefore, the Type I error will be precisely a if and 
only if a(B +1) is an integer. Although this reasoning is rigorous only if 7 is 
an exact pivot, experience shows that bootstrap P values based on nonpivotal 
statistics are less misleading if a(B + 1) is an integer. 


As a concrete example, suppose that a = .05 and B = 99. Then there are 5 
out of 100 values of r, namely, r = 0,1,...,4, that would lead us to reject the 
null hypothesis. Since these are equally likely if the test statistic is pivotal, we 
will make a Type I error precisely 5% of the time, and the test will be exact. 
But suppose instead that B = 89. Since the same 5 values of r would still 
lead us to reject the null, we would now do so with probability 5/90 = 0.0556. 


It is important that B be sufficiently large, since two problems can arise 
if it is not. The first problem is that the outcome of the test will depend 
on the sequence of random numbers used to generate the bootstrap samples. 
Different investigators may therefore obtain different results, even though they 
are using the same data and testing the same hypothesis. The second problem, 
which we will discuss in the next section, is that the ability of a bootstrap test 
to reject a false null hypothesis declines as B becomes smaller. As a rule of 
thumb, we suggest choosing B = 999. If calculating the 7; is inexpensive and 
the outcome of the test is at all ambiguous, it may be desirable to use a larger 
value, like 9999. On the other hand, if calculating the 7; is very expensive 
and the outcome of the test is unambiguous, because p* is far from a, it may 
be safe to use a value as small as 99. 


It is not actually necessary to choose B in advance. An alternative approach, 
which is a bit more complicated but can save a lot of computer time, has 
been proposed by Davidson and MacKinnon (2000). The idea is to calculate 
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a sequence of estimated P values, based on increasing values of B, and to 
stop as soon as the estimate p* allows us to be very confident that p* is either 
greater or less than a. For example, we might start with B = 99, then perform 
an additional 100 simulations if we cannot be sure whether or not to reject the 
null hypothesis, then perform an additional 200 simulations if we still cannot 
be sure, and so on. Eventually, we either stop when we are confident that the 
null hypothesis should or should not be rejected, or when B has become so 
large that we cannot afford to continue. 


Bootstrap versus Asymptotic Tests 


Although bootstrap tests based on test statistics that are merely asymptotic- 
ally pivotal are not exact, there are strong theoretical reasons to believe that 
they will generally perform better than tests based on approximate asymp- 
totic distributions. The errors committed by both asymptotic and bootstrap 
tests diminish as n increases, but those committed by bootstrap tests dimin- 
ish more rapidly. The fundamental theoretical result on this point is due to 
Beran (1988). The results of a number of Monte Carlo experiments have pro- 
vided strong support for this proposition. References include Horowitz (1994), 
Godfrey (1998), and Davidson and MacKinnon (1999a, 1999b, 2002a). 


We can illustrate this by means of an example. Consider the following simple 
special case of the linear regression model (4.63) 


Ye = Bı + baXı + B3y-1+ us, ue ~ N(0,0%), (4.69) 


where the null hypothesis is that 63 = 0.9. A Monte Carlo experiment to 
investigate the properties of tests of this hypothesis would work as follows. 
First, we fix a DGP in the model (4.69) by choosing values for the parameters. 
Here 33 = 0.9, and so we investigate only what happens under the null hypo- 
thesis. For each replication, we generate an artificial data set from our chosen 
DGP and compute the ordinary t statistic for G3 = 0.9. We then compute 
three P values. The first of these, for the asymptotic test, is computed using 
the Student’s t distribution with n — 3 degrees of freedom, and the other two 
are bootstrap P values from the parametric and semiparametric bootstraps, 
with residuals rescaled using (4.68), for B = 199.° We perform many replica- 
tions and record the frequencies with which tests based on the three P values 
reject at the .05 level. Figure 4.8 shows the rejection frequencies based on 
500,000 replications for each of 31 sample sizes: n = 10,12,14,..., 60. 


The results of this experiment are striking. The asymptotic test overrejects 
quite noticeably, although it gradually improves as n increases. In contrast, 


5 We used B = 199, a smaller value than we would ever recommend using in 
practice, in order to reduce the costs of doing the Monte Carlo experiments. 
Because experimental errors tend to cancel out across replications, this does 
not materially affect the results of the experiments. 
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Figure 4.8 Rejection frequencies for bootstrap and asymptotic tests 


the two bootstrap tests overreject only very slightly. Their rejection frequen- 
cies are always very close to the nominal level of .05, and they approach that 
level quite quickly as n increases. For the very smallest sample sizes, the 
parametric bootstrap seems to outperform the semiparametric one, but, for 
most sample sizes, there is nothing to choose between them. 


This example is, perhaps, misleading in one respect. For linear regression 
models, asymptotic t and F tests generally do not perform as badly as the 
asymptotic t test does here. For example, the ¢ test for 63; = 0 in (4.69) 
performs much better than the t test for 63 = 0.9; it actually underrejects 
moderately in small samples. However, the example is not at all misleading in 
suggesting that bootstrap tests will often perform extraordinarily well, even 
when the corresponding asymptotic test does not perform well at all. 


4.7 The Power of Hypothesis Tests 


To be useful, hypothesis tests must be able to discriminate between the null 
hypothesis and the alternative. Thus, as we saw in Section 4.2, the distribu- 
tion of a useful test statistic under the null is different from its distribution 
when the DGP does not belong to the null. Whenever a DGP places most of 
the probability mass of the test statistic in the rejection region of a test, the 
test will have high power, that is, a high probability of rejecting the null. 


For a variety of reasons, it is important to know something about the power 
of the tests we employ. If a test with high power fails to reject the null, this 
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tells us more than if a test with lower power fails to do so. In practice, more 
than one test of a given null hypothesis is usually available. Of two equally 
reliable tests, if one has more power than the other against the alternatives 
in which we are interested, then we would surely prefer to employ the more 
powerful one. 


The Power of Exact Tests 


In Section 4.4, we saw that an F statistic is a ratio of the squared norms of two 
vectors, each divided by its appropriate number of degrees of freedom. In the 
notation of that section, these vectors are, for the numerator, Pm, x,y, and, 
for the denominator, Mx y. If the null and alternative hypotheses are classical 
normal linear models, as we assume throughout this subsection, then, under 
the null, both the numerator and the denominator of this ratio are indepen- 
dent x? variables, divided by their respective degrees of freedom; recall (4.34). 
Under the alternative hypothesis, the distribution of the denominator is un- 
changed, because, under either hypothesis, Mx y = Mx wu. Consequently, the 
difference in distribution under the null and the alternative that gives the test 
its power must come from the numerator alone. 


From (4.33), r/o? times the numerator of the F statistic Fig, is 
1 E 
zY My X2( Xo! My X) X; Miy. (4.70) 


The vector XA My is normal under both the null and the alternative. Its 
mean is X>'M,X>(2, which vanishes under the null when 3, = 0, and its 
covariance matrix is 0? Xə M1 Xə. We can use these facts to determine the 
distribution of the quadratic form (4.70). To do so, we must introduce the 
noncentral chi-squared distribution, which is a generalization of the ordinary, 
or central, chi-squared distribution. 


We saw in Section 4.3 that, if the m-vector z is distributed as N(0,I), then 
||z||? = z'z is distributed as (central) chi-squared with m degrees of freedom. 
Similarly, if æ ~ N(0, Q), then #'Q-tx ~ x?(m). If instead z ~ N(p,1), 
then z'z follows the noncentral chi-squared distribution with m degrees of 
freedom and noncentrality parameter, or NCP, A = p'u. This distribution 
is written as x? (m, A). It is easy to see that its expectation is m + A; see 
Exercise 4.17. Likewise, if x ~ N(u, Q), then 2'Q-!x ~ x? (m, p Rtu). 
Although we will not prove it, the distribution depends on p and Q only 
through the quadratic form u! Rtu. If we set u = 0, we see that the x? (m, 0) 
distribution is just the central x? (m) distribution. 


Under either the null or the alternative hypothesis, therefore, the distribution 
of expression (4.70) is noncentral chi-squared, with r degrees of freedom, and 
with noncentrality parameter given by 


iL 


4= 


1 
BIX} My Xo(XI Mı X2) X7 Mı X2 > = -z b2 X2 Mı Xaß2. 
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Figure 4.9 Densities of noncentral x? distributions 


Under the null, A = 0. Under either hypothesis, the distribution of the 
denominator of the F statistic, divided by o?, is central chi-squared with n—k 
degrees of freedom, and it is independent of the numerator. The F statistic 
therefore has a distribution that we can write as 


x?(r, A)/r 
X(n — k)/(n -= k)’ 


with numerator and denominator mutually independent. This distribution is 
called the noncentral F distribution, with r and n — k degrees of freedom and 
noncentrality parameter A. In any given testing situation, r and n — k are 
given, and so the difference between the distributions of the F statistic under 
the null and under the alternative depends only on the NCP A. 


To illustrate this, we limit our attention to the expression (4.70), which is 
distributed as x?(r, A). As A increases, the distribution moves to the right 
and becomes more spread out. This is illustrated in Figure 4.9, which shows 
the density of the noncentral xy? distribution with 3 degrees of freedom for 
noncentrality parameters of 0, 2, 5, 10, and 20. The .05 critical value for the 
central y?(3) distribution, which is 7.81, is also shown. If a test statistic has 
the noncentral y?(3) distribution, the probability that the null hypothesis will 
be rejected at the .05 level is the probability mass to the right of 7.81. It is 
evident from the figure that this probability will be small for small values of 
the NCP and large for large ones. 


In Figure 4.9, the number of degrees of freedom r is held constant as A is 
increased. If, instead, we held A constant, the density functions would move 
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to the right as r was increased, as they do in Figure 4.4 for the special case 
with A = 0. Thus, at any given level, the critical value of a x? or F test will 
increase as r increases. It has been shown by Das Gupta and Perlman (1974) 
that this rightward shift of the critical value has a greater effect than the 
rightward shift of the density for any positive A. Specifically, Das Gupta and 
Perlman show that, for a given NCP, the power of a x? or F test at any given 
level is strictly decreasing in r, as well as being strictly increasing in A, as we 
indicated in the previous paragraph. 


The square of a t statistic for a single restriction is just the F test for that 
restriction, and so the above analysis applies equally well to t tests. Things 
can be made a little simpler, however. From (4.25), the t statistic tg, is 1/s 
times 

x}? Mıy 


o o; 4.71 
UIM (e 


The numerator of this expression, £} My, is normally distributed under both 
the null and the alternative, with variance o?z} M£ and mean æa M1 £o b2. 
Thus 1/o times (4.71) is normal with variance 1 and mean 


A = (a2) My a2)? b2. (4.72) 
It follows that tg, has a distribution which can be written as 


N(A,1) 
(x2(n — k)/(n— k) P 


with independent numerator and denominator. This distribution is known as 
the noncentral t distribution, with n — k degrees of freedom and noncentrality 
parameter À; it is written as t(n — k,A). Note that à? = A, where A is 
the NCP of the corresponding F test. Except for very small sample sizes, 
the t(n — k, A) distribution is quite similar to the N(A,1) distribution. It 
is also very much like an ordinary, or central, t distribution with its mean 
shifted from the origin to (4.72), but it has a bit more variance, because of 
the stochastic denominator. 


When we know the distribution of a test statistic under the alternative hy- 
pothesis, we can determine the power of a test of given level as a function of 
the parameters of that hypothesis. This function is called the power function 
of the test. The distribution of tg, under the alternative depends only on the 
NCP A. For a given regressor matrix X and sample size n, A in turn depends 
on the parameters only through the ratio 82/0; see (4.72). Therefore, the 
power of the t test depends only on this ratio. According to assumption (4.49), 
as n — oo, n-!X'X tends to a nonstochastic limiting matrix Stx. Thus, 
as n increases, the factor (æ Mja2)'/? will be roughly proportional to n!/?, 
and so A will tend to infinity with n at a rate similar to that of n1/?. 
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Figure 4.10 Power functions for t tests at the .05 level 


Figure 4.10 shows power functions for a very simple model, in which a2, the 
only regressor, is a constant. Power is plotted as a function of 32/o for three 
sample sizes: n = 25, n = 100, and n = 400. Since the test is exact, all 
the power functions are equal to .05 when 8 = 0. Power then increases as 8 
moves away from 0. As we would expect, the power when n = 400 exceeds 
the power when n = 100, which in turn exceeds the power when n = 25, for 
every value of 8 Æ 0. It is clear that, as n — co, the power function will 
converge to the shape of a T, with the foot of the vertical segment at .05 and 
the horizontal segment at 1.0. Thus, asymptotically, the test will reject the 
null with probability 1 whenever it is false. In finite samples, however, we can 
see from the figure that a false hypothesis is very unlikely to be rejected if 
n'/? B/o is sufficiently small. 


The Power of Bootstrap Tests 


As we remarked in Section 4.6, the power of a bootstrap test depends on B, 
the number of bootstrap samples. The reason why it does so is illuminating. 
If, to any test statistic, we add random noise independent of the statistic, we 
inevitably reduce the power of tests based on that statistic. The bootstrap 
P value p*(7) defined in (4.61) is simply an estimate of the ideal bootstrap 
P value 
or) = Pr(r > 7) = plim p* (ê), 
Boo 

where Pr(r > 7) is evaluated under the bootstrap DGP. When B is finite, p* 
will differ from p* because of random variation in the bootstrap samples. This 
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Figure 4.11 Power functions for tests at the .05 level 


random variation is generated in the computer, and is therefore completely 
independent of the random variable r. The bootstrap testing procedure dis- 
cussed in Section 4.6 incorporates this random variation, and in so doing it 
reduces the power of the test. 


Another example of how randomness affects test power is provided by the 
tests zg, and tg,, which were discussed in Section 4.4. Recall that zg, follows 
the N(0,1) distribution, because ø is known, and tg, follows the t(n — k) 
distribution, because ø has to be estimated. As equation (4.26) shows, tg, is 
equal to zg, times the random variable o/s, which has the same distribution 
under the null and alternative hypotheses, and is independent of zg,. There- 
fore, multiplying zg, by o/s simply adds independent random noise to the 
test statistic. This additional randomness requires us to use a larger critical 
value, and that in turn causes the test based on tg, to be less powerful than 
the test based on zg,- 


Both types of power loss are illustrated in Figure 4.11. It shows power func- 
tions for four tests at the .05 level of the null hypothesis that 8 = 0 in the 
model (4.01) with normally distributed error terms and 10 observations. All 
four tests are exact, as can be seen from the fact that, in all cases, power 
equals .05 when 8 = 0. For all values of @ Æ 0, there is a clear ordering of 
the four curves in Figure 4.11. The highest curve is for the test based on zg,, 
which uses the N(0,1) distribution and is available only when ø is known. 
The next three curves are for tests based on tg,. The loss of power from using 
te, with the ¢(9) distribution, instead of zg, with the N(0,1) distribution, is 
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quite noticeable. Of course, 10 is a very small sample size; the loss of power 
from not knowing o would be very much less for more reasonable sample sizes. 
There is a further loss of power from using a bootstrap test with finite B. This 
further loss is quite modest when B = 99, but it is substantial when B = 19. 


Figure 4.11 suggests that the loss of power from using bootstrap tests is gen- 
erally modest, except when B is very small. However, readers should be 
warned that the loss can be more substantial in other cases. A reasonable 
rule of thumb is that power loss will very rarely be a problem when B = 999, 
and that it will never be a problem when B = 9999. 


4.8 Final Remarks 


This chapter has introduced a number of important concepts, which we will 
encounter again and again throughout this book. In particular, we will en- 
counter many types of hypothesis test, sometimes exact but more commonly 
asymptotic. Some of the asymptotic tests work well in finite samples, but 
others do not. Many of them can easily be bootstrapped, and they will per- 
form much better when bootstrapped, but others are difficult to bootstrap or 
do not perform particularly well. 


Although hypothesis testing plays a central role in classical econometrics, it 
is not the only method by which econometricians attempt to make inferences 
from parameter estimates about the true values of parameters. In the next 
chapter, we turn our attention to the other principal method, namely, the 
construction of confidence intervals and confidence regions. 


4.9 Exercises 


4.1 Suppose that the random variable z follows the N(0,1) density. If z is a 
test statistic used in a two-tailed test, the corresponding P value, according 
to (4.07), is p(z) = 2(1 — ®(|z|)). Show that Fp(-), the CDF of p(z), is the 
CDF of the uniform distribution on [0, 1]. In other words, show that 


Fyp(x)=2 for all e€ [0,1]. 


4.2 Extend Exercise 1.6 to show that the third and fourth moments of the stan- 
dard normal distribution are 0 and 3, respectively. Use these results in order 
to calculate the centered and uncentered third and fourth moments of the 
N(u, 07) distribution. 


4.3 Let the density of the random variable x be f(x). Show that the density of 
the random variable w = ta, where t > 0, is t-'f(w/t). Next let the joint 
density of the set of random variables x;, i = 1,...,m, be f(£1,..., £m). For 
i= 1,..., m, let wi = tizi, ti > 0. Show that the joint density of the w; is 


1 w1 on) 
W1,...,Wm) = sceptics ; 
F ü m) M aa tm 
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4.4 


4.5 


4.6 


4.7 


4.8 


4.9 


4.10 
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Consider the random variables xı and x2, which are bivariate normal with 
xı ~ N(0,0?), 22 ~ N(0,03), and correlation p. Show that the expectation 
of xı conditional on x2 is p(01/o2)x2 and that the variance of xı conditional 
on £2 is oF (1 = p). How are these results modified if the means of xı and x2 
are fy and u2, respectively? 

Suppose that, as in the previous question, the random variables x, and x9 
are bivariate normal, with means 0, variances o? and o2, and correlation p. 
Starting from (4.13), show that f(x1,£2), the joint density of xı and z2, is 
given by 


1 1 —1 r? T1T2 r3 
ex 2 + ) i 
2m (1 — p?)!/20102 osama (3 P i02 o3 


Then use this result to show that xı and x2 are statistically independent 
if p = 0. 


Consider the linear regression model 


yt = G1 + b2Xt1 + 03X12 + ut. 


Rewrite this model so that the restriction Gg — 63 = 1 becomes a single zero 
restriction. 


Consider the linear regression model y = XG + u, where there are n obser- 
vations and k regressors. Suppose that this model is potentially subject to r 
restrictions which can be written as RG = r, where R is an r x k matrix and 
r is an r—vector. Rewrite the model so that the restrictions become r zero 
restrictions. 


Show that the t statistic (4.25) is (n — k)! times the cotangent of the angle 
between the n-vectors Mıy and M1x£2. 


Now consider the regressions 


y = Xı bı + b2£z2 +U, and 


(4.73) 
z2 = Xiyityoytv. 


What is the relationship between the t statistic for G2 = 0 in the first of these 
regressions and the t statistic for y2 = 0 in the second? 


Show that the OLS estimates 3, from the model (4.29) can be obtained from 
those of model (4.28) by the formula 


Bi = Bi + (XX) 1 XY Xe Bo. 


Formula (4.38) is useful for this exercise. 


Show that the SSR from regression (4.42), or equivalently, regression (4.41), 
is equal to the sum of the SSRs from the two subsample regressions: 


yi = X18, +u1, ui ~N(0,07D), and 
y2 = X2B2+u2, uz ~ N(0,07D). 
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4.11 


4.12 


4.13 


4.14 


4.15 


4.16 


4.17 


4.18 


When performing a Chow test, one may find that one of the subsamples is 
smaller than k, the number of regressors. Without loss of generality, assume 
that ng < k. Show that, in this case, the F statistic becomes 


(RSSR — SSR1)/n2 
SSRi/(n1 = k) 


bi 


and that the numerator and denominator really have the degrees of freedom 
used in this formula. 


Show, using the results of Section 4.5, that r times the F statistic (4.58) is 
asymptotically distributed as x° (r). 


Consider a multiplicative congruential generator with modulus m = 7, and 
with all reasonable possible values of A, that is, A = 2, 3,4,5,6. Show that, 
for any integer seed between 1 and 6, the generator generates each number of 
the form i/7, i = 1,...,6, exactly once before cycling for \ = 3 and \ = 5, 
but that it repeats itself more quickly for the other choices of A. Repeat the 
exercise for m = 11, and determine which choices of yield generators that 
return to their starting point before covering the full range of possibilities. 


If F is a strictly increasing CDF defined on an interval [a,b] of the real line, 
where either or both of a and b may be infinite, then the inverse function F =] 
is a well-defined mapping from [0,1] on to [a,b]. Show that, if the random 
variable X is a drawing from the U(0,1) distribution, then F~'(X) is a 
drawing from the distribution of which F is the CDF. 


In Section 3.6, we saw that Var(û+) = (1 — ht) oe, where ûz is the t™ residual 
from the linear regression model y = X8 + u, and hz is the tte diagonal 
element of the “hat matrix” Px; this was the result (3.44). Use this result to 
derive an alternative to (4.68) as a method of rescaling the residuals prior to 
resampling. Remember that the rescaled residuals must have mean 0. 


Suppose that z is a test statistic distributed as N (0,1) under the null hypo- 
thesis, and as N(A,1) under the alternative, where A depends on the DGP 
that generates the data. If ca is defined by (4.06), show that the power of 
the two-tailed test at level œ based on z is equal to 


P(A — ca) + ®(—Ca — A). 


Plot this power function for A in the interval [—5, 5] for a = .05 and a = .01. 


Show that, if the m—vector z ~ N(,1), the expectation of the noncentral 
chi-squared variable z'zism+ pop. 


The file classical.data contains 50 observations on three variables: y, x2, 
and #3. These are artificial data generated from the classical linear regression 
model 

y = ıt + 2x2 + B3a3+u, u~ N(0,07I). 


Compute a t statistic for the null hypothesis that 63 = 0. On the basis 
of this test statistic, perform an exact test. Then perform parametric and 
semiparametric bootstrap tests using 99, 999, and 9999 simulations. How do 
the two types of bootstrap P values correspond with the exact P value? How 
does this correspondence change as B increases? 
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4.19 Consider again the data in the file consumption.data and the ADL model 
studied in Exercise 3.22, which is reproduced here for convenience: 


Ct = at Bct—1 + YoYyt + y1ye—1 + ut. (3.70) 
Compute a t statistic for the hypothesis that yo +71 = 0. On the basis of this 


test statistic, perform an asymptotic test, a parametric bootstrap test, and a 
semiparametric bootstrap test using residuals rescaled according to (4.68). 
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Confidence Intervals 


5.1 Introduction 


Hypothesis testing, which we discussed in the previous chapter, is the foun- 
dation for all inference in classical econometrics. It can be used to find out 
whether restrictions imposed by economic theory are compatible with the 
data, and whether various aspects of the specification of a model appear to 
be correct. However, once we are confident that a model is correctly speci- 
fied and incorporates whatever restrictions are appropriate, we often want to 
make inferences about the values of some of the parameters that appear in 
the model. Although this can be done by performing a battery of hypothesis 
tests, it is usually more convenient to construct confidence intervals for the 
individual parameters of specific interest. A less frequently used, but some- 
times more informative, approach is to construct confidence regions for two 
or more parameters jointly. 


In order to construct a confidence interval, we need a suitable family of tests 
for a set of point null hypotheses. A different test statistic must be calculated 
for each different null hypothesis that we consider, but usually there is just 
one type of statistic that can be used to test all the different null hypotheses. 
For instance, if we wish to test the hypothesis that a scalar parameter 0 in a 
regression model equals 0, we can use a t test. But we can also use a t test 
for the hypothesis that 0 = ĝo for any specified real number 09. Thus, in this 
case, we have a family of t statistics indexed by 6. 


Given a family of tests capable of testing a set of hypotheses about a (scalar) 
parameter 0 of a model, all with the same level a, we can use them to construct 
a confidence interval for the parameter. By definition, a confidence interval is 
an interval of the real line that contains all values #9 for which the hypothesis 
that 0 = ĝo is not rejected by the appropriate test in the family. For level a, 
a confidence interval so obtained is said to be a 1 — a confidence interval, or 
to be at confidence level 1 — a. In applied work, .95 confidence intervals are 
particularly popular, followed by .99 and .90 ones. 


Unlike the parameters we are trying to make inferences about, confidence 
intervals are random. Every different sample that we draw from the same DGP 
will yield a different confidence interval. The probability that the random 
interval will include, or cover, the true value of the parameter is called the 
coverage probability, or just the coverage, of the interval. Suppose that all the 
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tests in the family have exactly level a, that is, they reject their corresponding 
null hypotheses with probability exactly equal to œ when the hypothesis is 
true. Then the coverage of the interval constructed from this family of tests 
will be precisely 1 — a. 


Confidence intervals may be either exact or approximate. When the exact 
distribution of the test statistics used to construct a confidence interval is 
known, the coverage will be equal to the confidence level, and the interval will 
be exact. Otherwise, we have to be content with approximate confidence inter- 
vals, which may be based either on asymptotic theory or on the bootstrap. In 
the next section, we discuss both exact confidence intervals and approximate 
ones based on asymptotic theory. Then, in Section 5.3, we discuss bootstrap 
confidence intervals. 


Like a confidence interval, a 1 — a confidence region for a set of k model para- 
meters, such as the components of a k-vector 0, is a region in a k-dimensional 
space (often, the region is the k-dimensional analog of an ellipse) constructed 
in such a way that, for every point represented by the k-vector Oo in the 
confidence region, the joint hypothesis that @ = Oo is not rejected by the 
appropriate member of a family of tests at level a. Thus confidence regions 
constructed in this way will cover the true values of the parameter vector 
100(1 — a)% of the time, either exactly or approximately. In Section 5.4, we 
show how to construct confidence regions and explain the relationship between 
confidence regions and confidence intervals. 


In previous chapters, we assumed that the error terms in regression models 
are independently and identically distributed. This assumption yielded a sim- 
ple form for the covariance matrix of a vector of OLS parameter estimates, 
expression (3.28), and a simple way of estimating this matrix. In Section 5.5, 
we show that it is possible to estimate the covariance matrix of a vector of 
OLS estimates even when we abandon the assumption that the error terms are 
identically distributed. Finally, in Section 5.6, we discuss a simple and widely- 
used method for obtaining standard errors, covariance matrix estimates, and 
confidence intervals for nonlinear functions of estimated parameters. 


5.2 Exact and Asymptotic Confidence Intervals 


A confidence interval for some scalar parameter 0 consists of all values 49 for 
which the hypothesis 0 = ĝo cannot be rejected at some specified level a. 
Thus, as we will see in a moment, we can construct a confidence interval 
by “inverting” a test statistic. If the finite-sample distribution of the test 
statistic is known, we will obtain an exact confidence interval. If, as is more 
commonly the case, only the asymptotic distribution of the test statistic is 
known, we will obtain an asymptotic confidence interval, which may or may 
not be reasonably accurate in finite samples. Whenever a test statistic based 
on asymptotic theory has poor finite-sample properties, a confidence interval 
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based on that statistic will have poor coverage: In other words, the interval 
will not cover the true parameter value with the specified probability. In such 
cases, it may well be worthwhile to seek other test statistics that will yield 
different confidence intervals with better coverage. 


To begin with, suppose that we wish to base a confidence interval for the 
parameter 0 on a family of test statistics that have a distribution or asymptotic 
distribution like the x? or the F distribution under their respective nulls. 
Statistics of this type are always positive, and tests based on them reject 
their null hypotheses when the statistics are sufficiently large. Such tests are 
often equivalent to two-tailed tests based on statistics distributed as standard 
normal or Student’s t. Let us denote the test statistic for the hypothesis that 
0 = o by the random variable T(y, 99). Here y denotes the sample used to 
compute the particular realization of the statistic. It is the random element 
in the statistic, since 7(-) is just a deterministic function of its arguments. 


For each ĝo, the test consists of comparing the realized T(y, 09) with the level a 
critical value of the distribution of the statistic under the null. If we write the 
critical value as ca, then, for any 69, we have by the definition of cg that 


Pro, (T(y, 90) < ca) =1—a. (5.01) 


Here the subscript ĝo indicates that the probability is calculated under the 
hypothesis that 0 = 60. If cq is a critical value for the asymptotic distribution 
of r(y, ĝo), rather than for the exact distribution, then (5.01) is only approxi- 
mately true. For 0o to belong to the confidence interval obtained by inverting 
the family of test statistics T(y, 00), it is necessary and sufficient that 


T(Y, 90) < Ca: (5.02) 
Thus the limits of the confidence interval can be found by solving the equation 
T(Y, 0) = ca (5.03) 


for 0. This equation will normally have two solutions. One of these solutions 
will be the upper limit, 0„, and the other will be the lower limit, 0;, of the 
confidence interval that we are trying to construct. 


If cq is an exact critical value for the test statistic T(y,@) at level a, then the 
confidence interval [6;, 8u] constructed in this way will have coverage 1 — a, 
as desired. To see this, observe first that, if we can find an exact critical 
value ca, the random function T(y, 0o) must be pivotal for the model M under 
consideration. In saying this, we are implicitly generalizing the definition of a 
pivotal quantity (see Section 4.6) to include random variables that may depend 
on the model parameters. A random function T(y, 0) is said to be pivotal for M 
if, when it is evaluated at the true value ĝo corresponding to some DGP in M, 
the result is a random variable whose distribution does not depend on what 
that DGP is. Pivotal functions of more than one model parameter are defined 
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in exactly the same way. The function is merely asymptotically pivotal if only 
the asymptotic distribution is invariant to the choice of DGP. 


Suppose that T(y, 9) is an exact pivot. Then, for every DGP in the model M, 
(5.01) holds exactly. Since 0o belongs to the confidence interval if and only if 
(5.02) holds, this means that the confidence interval contains the true para- 
meter value 09 with probability exactly equal to 1 — a, whatever the true 
parameter value may be. 


Even if it is not an exact pivot, the function T(y, 0o) must be asymptotically 
pivotal, since otherwise the critical value c, would depend asymptotically on 
the unknown DGP in M, and we could not construct a confidence interval with 
the correct coverage, even asymptotically. Of course, if cg is only approximate, 
then the coverage of the interval will differ from 1 — a to a greater or lesser 
extent, in a manner that, in general, depends on the unknown true DGP. 


Quantiles 


When we speak of critical values, we are implicitly making use of the concept 
of a quantile of the distribution that the test statistic follows under the null 
hypothesis. If F(x) denotes the CDF of a random variable X, and if the PDF 
f(x) = F'(x) exists and is strictly positive on the entire range of possible 
values for X, then qa, the a quantile of F, for 0 < a < 1, satisfies the equation 
F (qq) = a. The assumption of a strictly positive PDF means that F is strictly 
increasing over its range. Therefore, the inverse function F~! exists, and 
da = F~} (a). For this reason, F7} is sometimes called the quantile function. 
If F is not strictly increasing, or if the PDF does not exist, which, as we saw 
in Section 1.2, is the case for a discrete distribution, the a quantile does not 
necessarily exist, and is not necessarily uniquely defined, for all values of a. 


The 0.5 quantile of a distribution is often called the median. For a = 0.25, 0.5, 
and 0.75, the corresponding quantiles are called quartiles; for œ = 0.2, 0.4, 
0.6, and 0.8, they are called quintiles; for a = 1/10 with 7 an integer between 
1 and 9, they are called deciles; for a = 1/20 with 1 < i < 19, they are called 
vigintiles; and, for a = 7/100 with 1 < i < 99, they are called centiles. The 
quantile function of the standard normal distribution is shown in Figure 5.1. 
All three quartiles, the first and ninth deciles, and the .025 and .975 quantiles 
are shown in the figure. 


Asymptotic Confidence Intervals 


The discussion up to this point has deliberately been rather abstract, because 
T(y, 0o) can, in principle, be any sort of test statistic. To obtain more concrete 
results, let us suppose that 


T(y, 00) = ¢ = L T (5.04) 


0 


where 6 is an estimate of 0, and sg is the corresponding standard error, that 
is, an estimate of the standard deviation of 0. Thus T(y,9o) is the square 
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a 
0.025 0.10 0.25 0.50 0.75 0.90 0.975 


Figure 5.1 The quantile function of the standard normal distribution 


of the t statistic for the null hypothesis that 6 = 69. If 6 were an OLS 
estimate of a regression coefficient, then, under conditions that were discussed 
in Section 4.5, the test statistic defined in (5.04) would be asymptotically 
distributed as y?(1) under the null hypothesis. Therefore, the asymptotic 
critical value ca would be the 1 — a quantile of the x?(1) distribution. 


For the test statistic (5.04), equation (5.03) becomes 


( — 
= Cy 
50 


Taking the square root of both sides and multiplying by sọ then gives 


IÊ — 6| = soc’. (5.05) 
As expected, there are two solutions to equation (5.05). These are 


0, = ĝ— soet? and Ou = 6+ so ctl? 


and so the asymptotic 1 — a confidence interval for @ is 
[ô — saci/?, Ô + seci/?]. (5.06) 


This means that the interval consists of all values of 0 between the lower limit 
6 — sọc}/? and the upper limit Ô + sọc}/?. For a = 0.05, the 1 — a quantile 


Q 
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Ca = 3.8415 


(6 — 0)/s9)° 


I 19659 1 8 TS 


a 0 Ou 


Figure 5.2 A symmetric confidence interval 


of the x?(1) distribution is 3.8415, the square root of which is 1.9600. Thus 
the confidence interval given by (5.06) becomes 


[Ê — 1.9659, Ô+ 1.9659]. (5.07) 


This interval is shown in Figure 5.2, which illustrates the manner in which 
it is constructed. The value of the test statistic is on the vertical axis of the 
figure. The upper and lower limits of the interval occur at the values of 0 
where the test statistic (5.04) is equal to ca, which in this case is 3.8415. 


We would have obtained the same confidence interval as (5.06) if we had 
started with the asymptotic t statistic (0 — 0o)/sọ and used the N(0,1) dis- 
tribution to perform a two-tailed test. For such a test, there are two critical 
values, one the negative of the other, because the N(0,1) distribution is sym- 
metric about the origin. These critical values are defined in terms of the 
quantiles of that distribution. The relevant ones are now the a/2 and the 
1 — (a/2) quantiles, since we wish to have the same probability mass in each 
tail of the distribution. It is conventional to denote these quantiles of the 
standard normal distribution by zg/2 and 21_(q/2), respectively. Note that 
Zq/2 is negative, since a/2 < 1/2, and the median of the N(0,1) distribution 
is 0. By symmetry, it is the negative of z1_(q/2). Equation (5.03), which has 
two solutions for a x? test, is replaced by two equations, each with just one 
solution, as follows: 
T(y,0) = c. 


Here T(y,0) denotes the (signed) t statistic rather than the x?(1) statistic 
used in (5.03), and the positive number c can be defined either as 24—(a/2) 
or aS —Za/2- The resulting confidence interval [6;, 0u] can thus be written in 
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two different ways: 
[Â + 89 Ze/25 Â — s9Zq/2| and [6 — 86 24_(a/2); 6+ 86 21_(a/2|- (5.08) 


When a = .05, we once again obtain the interval (5.07), since z.925 = —1.96 
and z.975 = 1.96. 


Asymmetric Confidence Intervals 


The confidence interval (5.06), which is the same as the interval (5.08), is a 
symmetric one, because 6; is as far below 6 as ĝu is above it. Although many 
confidence intervals are symmetric, not all of them share this property. The 
symmetry of (5.06) is a consequence of the symmetry of the standard normal 
distribution and of the form of the test statistic (5.04). 


It is possible to construct confidence intervals based on two-tailed tests even 
when the distribution of the test statistic is not symmetric. For a chosen 
level a, we wish to reject whenever the statistic is too far into either the 
right-hand or the left-hand tail of the distribution. Unfortunately, there are 
many ways to interpret “too far” in this context. The simplest is probably 
to define the rejection region in such a way that there is a probability mass 
of a/2 in each tail. This is called an equal-tailed confidence interval. Two 
critical values are needed for each level, a lower one, c,, which will be the 
a/2 quantile of the distribution, and an upper one, ct, which will be the 
1 — (a/2) quantile. A realized statistic 7 will lead to rejection at level a 
if either 7 < cg or 7 > ct. This will lead to an asymmetric confidence 
interval. We will discuss such intervals, where the critical values are obtained 
by bootstrapping, in the next section. 


It is also possible to construct confidence intervals based on one-tailed tests. 
Such an interval will be open all the way out to infinity in one direction. Sup- 
pose that, for each ĝo, the null 0 < 6o is tested against the alternative 0 > 0. 
If the true parameter value is finite, we will never want to reject the null for 
any ĝo that substantially exceeds the true value. Consequently, the confidence 
interval will be open out to plus infinity. Formally, the null is rejected only 
if the signed t statistic is algebraically greater than the appropriate critical 
value. For the N(0,1) distribution, this is z;_. for level a. The null 6 < 4% 
will not be rejected if T(y, 00) < z1~a, that is, if Ô — o < S9z1~q- The interval 
over which ĝo satisfies this inequality is just 


[6 — sez1-o, +00]. (5.09) 


P Values and Asymmetric Distributions 


The above discussion of asymmetric confidence intervals raises the question of 
how to calculate P values for two-tailed tests based on statistics with asym- 
metric distributions. This is a little tricky, but it will turn out to be useful 
when we discuss bootstrap confidence intervals in the next section. 
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If the P value is defined, as usual, as the smallest level for which the test 
rejects, then, if we denote by F the CDF used to calculate critical values or 
P values, the P value associated with a statistic r should be 2F (7) if 7 is 
in the lower tail, and 2(1 — F(7)) if it is in the upper tail. This can be seen 
by the same arguments, based on Figure 4.2, that were used for symmetric 
two-tailed tests. A slight problem arises as to the point of separation between 
the left and right sides of the distribution. However, it is easy to see that 
only one of the two possible P values is less than 1, unless F(r) is exactly 
equal to 0.5, in which case both are equal to 1, and there is no ambiguity. In 
complete generality, then, we have that the P value is 


p(T) = 2min(F(r),1— F(r)). (5.10) 


Thus the point that separates the left and right sides of the distribution is 
the median, q.50, since F'(q.50) = .50 by definition. Any 7 greater than the 
median is in the right-hand tail of the distribution, and any 7 less than the 
median is in the left-hand tail. 


Exact Confidence Intervals for Regression Coefficients 

In Section 4.4, we saw that, for the classical normal linear model, exact tests 
of linear restrictions on the parameters of the regression function are available, 
based on the t and F distributions. This implies that we can construct exact 
confidence intervals. Consider the classical normal linear model (4.21), in 
which the parameter vector @ has been partitioned as [61 : G2], where 3, is 
a (k — 1)-vector and ə is a scalar. The t statistic for the hypothesis that 
B2 = G29 for any particular value G29 can be written as 


Bo T B20 


- (5.11) 


where sg is the usual OLS standard error for Âz. 


Any DGP in the model (4.21) satisfies G2 = (20 for some 329. With the 
correct value of (9, the t statistic (5.11) has the t(n — k) distribution, and so 


Pr (ta < b2 — Bao < n-oa) =1l-a, (5.12) 
52 


where ta/2 and t1—(a/2) denote the a/2 and 1—(a/2) quantiles of the t(n — k) 
distribution. We can use equation (5.12) to find a 1 — a confidence interval 
for Gg. The left-hand side of the equation is equal to 


Pr(s2ta/2 < Bo — Boo < 82t1_(a/2)) 
= Pr(—s2ta/2 > B20 — Bo > —82t1_(a/2)) 
= Pr (Go — 82ta/2 > P20 = Bo — 82t1-(a/2)): 
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Therefore, the confidence interval we are seeking is 


[ô> — S2t1—(a/2)» Bo = S2ta/2]- (5.13) 


At first glance, this interval may look a bit odd, because the upper limit is 
obtained by subtracting something from Bo. What is subtracted is negative, 
however, because ty/2 < 0, since it is in the lower tail of the t distribution. 
Thus the interval does in fact contain the point estimate (2. 


It may still seem strange that the lower and upper limits of (5.13) depend, 
respectively, on the upper-tail and lower-tail quantiles of the t(n — k) distri- 
bution. This actually makes perfect sense, however, as can be seen by looking 
at the infinite confidence interval (5.09) based on a one-tailed test. There, 
since the null is that 0 < 60, the confidence interval must be open out to +00, 
and so only the lower limit of the confidence interval is finite. But the null is 
rejected when the test statistic is in the upper tail of its distribution, and so 
it must be the upper-tail quantile that determines the only finite limit of the 
confidence interval, namely, the lower limit. Readers are strongly advised to 
take some time to think this point through, since most people find it strongly 
counter-intuitive when they first encounter it, and they can accept it only 
after a period of reflection. 


In the case of (5.13), it is easy to rewrite the confidence interval so that 
it depends only on the positive, upper-tail, quantile, t;_(q/2). Because the 
Student’s t distribution is symmetric, the interval (5.13) is the same as the 
interval 


[ bs = $3t1—(0/2); Bo + s2t1-(a/2)|; (5.14) 
compare the two ways of writing the confidence interval (5.08). For con- 
creteness, suppose that a = .05 and n — k = 32. In this special case, 


t1-(a/2) = t.975 = 2.037. Thus the .95 confidence interval based on (5.14) 
extends from 2.037 standard errors below 62 to 2.037 standard errors above 
it. This interval is slightly wider than the interval (5.07), which is based on 
asymptotic theory. 


We obtained the interval (5.14) by starting from the t statistic (5.11) and 
using the Student’s t distribution. As readers are asked to demonstrate in 
Exercise 5.2, we would have obtained precisely the same interval if we had 
started instead from the square of (5.11) and used the F distribution. 


5.3 Bootstrap Confidence Intervals 


When exact confidence intervals are not available, and they generally are not, 
asymptotic ones are normally used. However, just as asymptotic tests do 
not always perform well in finite samples, neither do asymptotic confidence 
intervals. Since bootstrap P values and tests based on them often outperform 
their asymptotic counterparts, it seems natural to base confidence intervals 
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on bootstrap tests when asymptotic intervals give poor coverage. There are 
a great many varieties of bootstrap confidence intervals; for a comprehensive 
discussion, see Davison and Hinkley (1997). 


When we construct a bootstrap confidence interval, we wish to treat a fam- 
ily of tests, each corresponding to its own null hypothesis. Since, when we 
perform a bootstrap test, we must use a bootstrap DGP that satisfies the 
null hypothesis, it appears that we must use an infinite number of bootstrap 
DGPs if we are to consider the full family of tests, each with a different null. 
Fortunately, there is a clever trick that lets us avoid this difficulty completely. 


It is, of course, essential for a bootstrap test that the bootstrap DGP should 
satisfy the null hypothesis under test. However, when the distribution of the 
test statistic does not depend on precisely which null is being tested, the same 
bootstrap distribution can be used for a whole family of tests with different 
nulls. If a family of test statistics is defined in terms of a pivotal random 
function T(y, 90), then, by definition, the distribution of this function is inde- 
pendent of 09. Thus we could choose any value of #9 that the model allows for 
the bootstrap DGP, and the distribution of the test statistic, evaluated at 9, 
would always be the same. The important thing is to make sure that 7(-) is 
evaluated at the same value of 69 as the one used to generate the bootstrap 
samples. Even if 7(-) is only asymptotically pivotal, the effect of the choice 
of fo on the distribution of the statistic should be slight if the sample size is 
reasonably large. 


Suppose that we wish to construct a bootstrap confidence interval based on 
the t statistic ¢(09) = T(y, 00) = (Ê —90)/s9. The first step is to compute 6 
and sg using the original data y. Then we generate bootstrap samples using a 
DGP, which may be either parametric or semiparametric, characterized by 6 
and by any other relevant estimates, such as the error variance, that may be 
needed. The resulting bootstrap DGP is thus quite independent of ĝo, but it 
does depend on the estimate ĝ. 


We can now generate B bootstrap samples, yj, j = 1,...,B. For each of 
these, we compute an estimate 9; and its standard error st in exactly the 
same way that we computed 6 and sọ from the original data, and we then 
compute the bootstrap “t statistic” 


t = T(y}, 0) = ~ (5.15) 
J 


This is the statistic that tests the null hypothesis that 0 = 6, because 6 is the 
true value of 6 for the bootstrap DGP. If 7(-) is an exact pivot, the change 
of null from ĝo to 6 makes no difference. If T(-) is an asymptotic pivot, there 
should usually be only a slight difference for values of 9 close to 6. 


The limits of the bootstrap confidence interval will depend on the quantiles of 
the EDF of the t}. We can choose to construct either a symmetric confidence 
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interval, by estimating a single critical value that applies to both tails, or 
an asymmetric one, by estimating two different critical values. When the 
distribution of the underlying test statistic T(y, 0) is not symmetric, the 
latter interval should be more accurate. For this reason, and because we did 
not discuss asymmetric intervals based on asymptotic tests, we now discuss 
asymmetric bootstrap confidence intervals in some detail. 


Asymmetric Bootstrap Confidence Intervals 


Let us denote by F* the EDF of the B bootstrap statistics t- For given 9, 
the bootstrap P value is, from (5.10), 


p(€(@o)) = 2min(F*(i(60)), 1 — F*(E())). (5.16) 


If this P value is greater than or equal to a, then 09 belongs to the 1—a 
confidence interval. If F* were the CDF of a continuous distribution, we could 
express the confidence interval in terms of the quantiles of this distribution, 
just as in (5.13). In the limit as B — oo, the limiting distribution of the 7; 
which we call the ideal bootstrap distribution, is usually continuous, and its 
quantiles define the ideal bootstrap confidence interval. However, since the 
distribution of the t; is always discrete in practice, we must be a little more 
careful in our reasoning. 


Suppose, to begin with, that #(9) is on the left side of the distribution. Then 
the bootstrap P value (5.16) is 


B 
i 2 7 2 
2F*( A (90)) = B 2i a t (90) = r(6o) f 


where r(0o) is the number of bootstrap t statistics that are less than or equal 
to Ê(0o). Thus ĝo belongs to the 1 — a confidence interval if and only if 
2r(00)/B > a, that is, if r(09) > aB/2. Since r(9) is an integer, while aB/2 
is not an integer, in general, this inequality is equivalent to r(99) > rave, 
where rajz is the smallest integer not less than aB /2. 


First, observe that r(@9) cannot exceed rajz for 0o sufficiently large. Since 
Elo) = (Ô — 0o) /so, it follows that #()) + —oo as 0) — co. Accordingly, 
r(09) — 0 as 09 — oo. Therefore, there exists a greatest value of ĝo for which 

r(00) > Tas. This value must be the upper limit of the 1 — a bootstrap 

confidence interval. 


* 


Suppose we sort the t} from smallest to largest and denote by 4/2 the entry 
in the sorted list indexed by ra/2. Then, if Êlo) = ct ex the number of the t} 
less than or equal to £(0o) is precisely ra /2- But if ¢(09) is smaller than c*, J2 by 
however small an amount, this number is strictly less than rg/2. Thus 0u, the 
upper limit of the confidence interval, is defined implicitly by £(0,) = c* a2 
Explicitly, we have : 

0, =0— So Ca /2- 
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As in the previous section, we see that the upper limit of the confidence 
interval is determined by the lower tail of the bootstrap distribution. 


If the statistic is an exact pivot, then the probability that the true value of 0 
is greater than 6, is exactly equal to a/2 only if a(B + 1)/2 is an integer. 
This follows by exactly the same argument as the one given in Section 4.6 
for bootstrap P values. As an example, if a = .05 and B = 999, we see that 
a(B + 1)/2 = 25. In addition, since aB/2 = 24.975, we see that raj2 = 25. 
The value of c$ j2 is therefore the value of the 25 bootstrap t statistic when 
they are sorted in ascending order. 


In order to obtain the upper limit of the confidence interval, we began above 
with the assumption that Êê(0o) is on the left side of the distribution. If we 
had begun by assuming that ¢(09) is on the right side of the distribution, we 
would have found that the lower limit of the confidence interval is 


0; = 6 = $6 Ci_(a/2)> 


where cj_(q/2) is the entry indexed by r1~(a/2) when the t} are sorted in 
ascending order. For the example with a = .05 and B = 999, this is the 
975 entry in the sorted list, since there are precisely 25 integers in the range 
975—999, just as there are in the range 1—25. 


The asymmetric equal-tail bootstrap confidence interval can be written as 


A 


(0i, Ou] = [ô = So Cl —(a/2)> 0— So Ch J2] F (5:17) 


This interval bears a striking resemblance to the exact confidence inter- 
val (5.13). Clearly, c{_(a/2) and c% j2, which are approximately the 1 — (a/2) 
and a/2 quantiles of the EDF of the bootstrap tests, play the same roles as 
the 1 — (a/2) and a/2 quantiles of the exact Student’s t distribution. 


Because the Student’s t distribution is symmetric, the confidence interval 
(5.13) is symmetric. In contrast, the interval (5.17) will almost never be sym- 
metric. Even if the distribution of the underlying test statistic happened to be 
symmetric, the bootstrap distribution based on finite B would almost never 
be. It is, of course, possible to construct a symmetric bootstrap confidence 
interval. We just need to invert a test for which the P value is not (5.10), 
but rather something like (4.07), which is based on the absolute value, or, 
equivalently, the square, of the t statistic. See Exercise 5.7. 


The bootstrap confidence interval (5.17) is called a studentized bootstrap 
confidence interval. The name comes from the fact that a statistic is said to 
be studentized when it is the ratio of a random variable to its standard error, 
as is the ordinary t statistic. This type of confidence interval is also sometimes 
called a percentile-¢ or bootstrap-t confidence interval. Studentized bootstrap 
confidence intervals have good theoretical properties, and, as we have seen, 
they are quite easy to construct. If the assumptions of the classical normal 
linear model are violated and the empirical distribution of the t; provides a 
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better approximation to the actual distribution of the t statistic than does the 
Student’s t distribution, then the studentized bootstrap confidence interval 
should be more accurate than the usual interval based on asymptotic theory. 


As we remarked above, there are a great many ways to compute bootstrap 
confidence intervals, and there is a good deal of controversy about the rel- 
ative merits of different approaches. For an introduction to the voluminous 
literature, see DiCiccio and Efron (1996) and the associated discussion. Some 
of the approaches in the literature appear to be obsolete, mere relics of the 
way in which ideas about the bootstrap were developed, and others are too 
complicated to explain here. Even if we limit our attention to studentized 
bootstrap intervals, there will often be several ways to proceed. Different 
methods of estimating standard errors inevitably lead to different confidence 
intervals, as do different ways of parametrizing a model. Thus, in practice, 
there will frequently be quite a number of reasonable ways to construct stu- 
dentized bootstrap confidence intervals. 


Note that specifying the bootstrap DGP is not at all trivial if the error terms 
are not assumed to be IID. In fact, this topic is quite advanced and has 
been the subject of much research: See Li and Maddala (1996) and Davison 
and Hinkley (1997), among others. Later in the book, we will discuss a few 
techniques that can be used with particular models. 


Theoretical results discussed in Hall (1992) and Davison and Hinkley (1997) 
suggest that studentized bootstrap confidence intervals will generally work 
better than intervals based on asymptotic theory. However, their coverage 
can be quite unsatisfactory in finite samples if the quantity (0 — 0)/s@ is far 
from being pivotal, as can happen if the distributions of either 0 or sg de- 
pend strongly on the true unknown value of 0 or on any other parameters 
of the model. When this is the case, the standard errors will often fluctuate 
wildly among the bootstrap samples. Of course, the coverage of asymptotic 
confidence intervals will generally also be unsatisfactory in such cases. 


5.4 Confidence Regions 


When we are interested in making inferences about the values of two or more 
parameters, it can be quite misleading to look at the confidence intervals 
for each of the parameters individually. By using confidence intervals, we are 
implicitly basing our inferences on the marginal distributions of the parameter 
estimates. However, if the estimates are not independent, the product of the 
marginal distributions may be very different from the joint distribution. In 
such cases, it makes sense to construct a confidence region. 


The confidence intervals we have discussed are all obtained by inverting t tests, 
whether exact, asymptotic, or bootstrap, based on families of statistics of the 
form (0 — 09)/sg. If we wish instead to construct a confidence region, we must 
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invert joint tests for several parameters. These will usually be tests based on 
statistics that follow the F or x? distributions, at least asymptotically. 


A t statistic depends explicitly on a parameter estimate and its standard error. 
Similarly, many tests for several parameters depend on a vector of parameter 
estimates and an estimate of their covariance matrix. Even many statistics 
that appear not to do so, such as F statistics, actually do so implicitly, as we 
will see shortly. Suppose that we have a k—vector of parameter estimates 8, 
of which the covariance matrix Var(@) can be estimated by Var(@). Then, in 
many circumstances, the statistic 


(ô — 6)"(Var(6)) (Ô — 80) (5.18) 


can be used to test the joint null hypothesis that 0 = 0o. 


The asymptotic distribution of (5.18) can be found by using Theorem 4.1. It 
tells us that, if a k-vector æ is distributed as N (0, 2), then the quadratic 
form æ' tæ is distributed as x?(k). In order to use this result to show 
that the statistic (5.18) is asymptotically distributed as x? (k) under the null 
hypothesis, we must study a little more asymptotic theory. 


Asymptotic Normality and Root-n Consistency 


Although the notion of asymptotic normality is very general, for now we will 
introduce it for linear regression models only. Suppose, as in Section 4.5, that 
the data were generated by the DGP 


y=XBotu, u~ IID(0,o@I), (5.19) 


given in (4.47). We have seen that the random vector v = n~!/?X'u defined 
in (4.53) follows the normal distribution asymptotically, with mean vector 0 
and covariance matrix oê Sxtx, where Syrty is the plim of n~!X'X as the 
sample size n tends to infinity. 


Consider now the estimation error of the vector of OLS estimates. For the 
DGP (5.19), it is 
Ê- Bo =(X'X)'X'u. (5.20) 


As we saw in Section 3.3, Ê will be consistent under fairly weak conditions. 
If it is, expression (5.20) tends to a limit of O as the sample size n — oo. 
Therefore, its limiting covariance matrix is a zero matrix. Thus it would 
appear that asymptotic theory has nothing to say about limiting variances for 
consistent estimators. However, this is easily corrected by the usual device of 
introducing a few well-chosen powers of n. If we rewrite (5.20) as 


7 A -1 
n¥2(8 — By) = (1. x"x) n-1/2 XTu, 


then the first factor on the right-hand side tends to Soig as n — oo, and 


the second factor, which is just v, tends to a random vector distributed as 
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N(0, of Sx1x). Because Syrty is deterministic, we find that, asymptotically, 


Var (n! (Ê — Bo)) Sog 25, ty Date Soc oh Six 


Moreover, since the vector n!/ 2(8 — Bo) is, asymptotically, just a determinis- 
tic linear combination of the components of the multivariate normal random 
vector v, we conclude that 


n'/2(B — By) © N(0, 02 Sxty). (5.21) 


Thus, under the fairly weak conditions we used in Section 4.5, we see that the 
vector 3 is asymptotically normal, or exhibits asymptotic normality. 


The result (5.21) tells us that the a covariance matrix of the vector 

n!/2(Ê— Bo) is the limit of o2(n~!X'X)-! as n — oo. In practice, we divide 
this by n and use s?(X'X)~! to estimate Var(ĝ), where s? is the usual 
OLS estimate of the error variance; recall (3.49). However, it is important 
to remember that, whenever n~1X'X tends to Syrty as n — oo, the matrix 
(X! X) t, without the factor of n, simply tends to a zero matrix. As we saw a 
moment ago, this is just a consequence of the fact that B is consistent. Thus, 
although it would be convenient if we could dispense with powers of n when 
working out asymptotic approximations to covariance matrices, it would be 
mathematically incorrect and very risky to do so. 


The result (5.21) also gives us the rate of convergence of 3 to its probability 
limit of Bo. Since multiplying the estimation error by n'/? gives rise to an 
expression of zero mean and finite covariance matrix, it follows that the esti- 
mation error itself tends to zero at the same rate as n—'/?. This property is 
expressed by saying that the estimator B is root-n consistent. 


Quite generally, let Ô be a root-n consistent, asymptotically normal, estimator 
of a parameter vector 0. Any estimator of the covariance matrix of 0 must 
tend to zero as n — oo. Let @ denote the true value of 0, and let V_denote 
the limiting covariance matrix of n!/?(@ — 8o). Then an estimator Var(@) is 
said to be a consistent estimator of the covariance matrix of 0 if 

plim (n Var(0)) = V. (5.22) 


n— Co 


We are finally in a position to justify the use of (5.18) as a statistic distributed 
as X? (k) under the null hypothesis. If Ô is root-n consistent and asymptotically 
normal, and if Var (0) is a consistent estimator of the variance of 8, then we 
can write (5.18) as 


n'/?(6 — 0o)" (n Var(0)) -n'/?(6 — 00). (5.23) 


Since n!/ 2(ĝ — ĝo) is asymptotically normal under the null, with mean zero, 
and since the middle factor above tends to the inverse of its limiting covariance 
matrix, expression (5.23) is precisely in the form æ! -tæ of Theorem 4.1, and 
so (5.18) is asymptotically distributed under the null as y?(k). 
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Exact Confidence Regions for Regression Parameters 


Suppose that we want to construct a confidence region for the elements of the 
vector 2 in the classical normal linear model (4.28), which we rewrite here 
for ease of exposition: 


y = X18, + X2ß2 +u, u~ N(0,07I), (5.24) 


where (3; and (2 are a ky—-vector and a k2—-vector, respectively. The F statistic 
that can be used to test the hypothesis that G2 = 0 is given in (4.33). If we 
wish instead to test B2 = B20, then we can write (5.24) as 


y — X2 Boe = Xiyi + Xə%2 +u, u~ N(0,07I), (5.25) 


and test y2 = 0. It is not hard to show that the F statistic for this hypothesis 
takes the form 


(Ê> — B20)'X Mı X2(B2 — P20)/ ke 
y'Mxy/(n—k) l 
where k = kı + kg; see Exercise 5.8. When multiplied by k2, this F statistic 


is in the form of (5.18). For the purposes of inference on (2, regression (5.24) 
is, by the FWL Theorem, equivalent to the regression 


(5.26) 


Myy = M, X28. + Miu. 


Thus Var (2) is equal to 02(X'M,X2)~!. Since the denominator of (5.26) is 
just s?, the OLS estimate of the error variance from running regression (5.24), 
ky times the F statistic (5.26) can be written in the form of (5.18), with 


Var (82) = (X7 M: X2)" 


providing a consistent estimator of the variance of Bo; compare (3.50). 


Under the assumptions of the classical normal linear model, the F statistic 
(5.26) follows the F'(k2,n — k) distribution when the null hypothesis is true. 
Therefore, we can use it to construct an exact confidence region. If c, denotes 
the 1 — a quantile of the F(k2,n — k) distribution, then the 1 — a confidence 
region is the set of all G29 for which 


(82 — B20)" X? Mj.X2(B2 — B20) < ca k28. (5.27) 


Since the left-hand side of this inequality is quadratic in B20, the confidence 
region is, for ky = 2, the interior of an ellipse and, for kọ > 2, the interior of 
a k-dimensional ellipsoid. 
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Confidence ellipse for (61, G2) 
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Figure 5.3 Confidence ellipses and confidence intervals 


Confidence Ellipses and Confidence Intervals 


Figure 5.3 illustrates what a confidence ellipse can look like when there are 
just two components in the vector G2, which we denote by (3; and 62, and the 
parameter estimates are negatively correlated. The ellipse, which defines a 
.95 confidence region, is centered at the parameter estimates (ô, 2), with its 
major axis oriented from upper left to lower right. Confidence intervals for 61 
and (3 are also shown. The .95 confidence interval for 3, is the line segment 
AB, and the .95 confidence interval for (2 is the line segment EF. We would 
make quite different inferences if we considered AB and EF, and the rectangle 
they define, demarcated in Figure 5.3 by the lines drawn with long dashes, 
rather than the confidence ellipse. There are many points, such as (87, 64), 
that lie outside the confidence ellipse but inside the two confidence intervals. 
At the same time, there are some points, like (61, 05), that are contained in 
the ellipse but lie outside one or both of the confidence intervals. 


In the framework of the classical normal linear model, the estimates By and Bo 
are bivariate normal. The t statistics used to test hypotheses about just one 
of 3, or p2 are based on the marginal univariate normal distributions of Ay 
and (3, respectively, but the F statistics used to test hypotheses about both 
parameters at once are based on the joint bivariate normal distribution of the 
two estimators. If By and Bo are not independent, as is the case in Figure 5.3, 
then information about one of the parameters also provides information about 
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the other. Only the confidence region, based on the joint distribution, allows 
this to be taken into account. 


An example may be helpful at this point. Suppose that we are trying to model 
daily electricity demand during the summer months in an area where air con- 
ditioning is prevalent. Since the use of air conditioners, and hence electricity 
demand, is related to both temperature and humidity, we might want to use 
measures of both of them as explanatory variables. In many parts of the 
world, summer temperatures and humidity are strongly positively correlated. 
Therefore, if we include both variables in a regression, they may be approx- 
imately collinear. If so, as we saw in Section 3.4, the OLS estimates will be 
relatively imprecise. This lack of precision implies that confidence intervals for 
the coefficients of both temperature and humidity will be relatively long, and 
that confidence regions for both parameters jointly will be long and narrow. 
However, it does not necessarily imply that the area of a confidence region 
will be particularly large. This is precisely the situation that is illustrated in 
Figure 5.3. Think of G; as the coefficient of the temperature and (2 as the 
coefficient of the humidity. 


In Exercise 5.9, readers are asked to show that, when there are two explana- 
tory variables in a linear regression model, the correlation between the OLS 
estimates of the parameters associated with these variables is the negative of 
the correlation between the variables themselves. Thus, in the example we 
have been discussing, a positive correlation between temperature and humid- 
ity leads to a negative correlation between the estimates of the temperature 
and humidity parameters, as shown in Figure 5.3. A point like (87,6%) is 
excluded from the confidence region because the variation in electricity de- 
mand cannot be accounted for if both coefficients are small. But 8Y cannot be 
excluded from the confidence interval for 3, alone, because 87, which assigns 
a small effect to the temperature, is perfectly compatible with the data if a 
large effect is assigned to the humidity, that is, if G2 is substantially greater 
than GY. At the same time, even though /3; is outside the confidence interval 
for 31, the point (81, 35) is inside the confidence region, because the very high 
value of (34 is enough to compensate for the very low value of 64. 


The relation between a confidence region for two parameters and confidence 
intervals for each of the parameters individually is a subtle one. It is tempting 
to think that the ends of the intervals should be given by the extreme points 
of the confidence ellipse. This would imply, for example, that the confidence 
interval for 3, in the figure is given by the line segment CD. Even without 
the insight afforded by the temperature-humidity example, however, we can 
see that this must be incorrect. The inequality (5.27) defines the confidence 
region, for given parameter estimates G, and Bo, as a set of values in the 
space of the vector (9. If instead we think of (5.27) as defining a region in 
the space of ice with 329 the true parameter vector, then we obtain a region 
of exactly the same size and shape as the confidence region, because (5.27) is 
symmetric in 329 and B2. We can assign a probability of 1 — a to the event 
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that Bo belongs to the new region, because the inequality (5.27) states that 
the F statistic is less than its 1 — œ quantile, an event of which the probability 
is 1 — a, by definition. 


An exactly similar argument can be made for the confidence interval for 61. 
In the two-dimensional framework of Figure 5.3, the entire infinitely high 
rectangle bounded by the vertical lines through the points A and B has the 
same size and shape as an area with probability 1 — a, since we are willing 
to allow (2 to take on any real value. Because the infinite rectangle and the 
confidence ellipse must contain the same probability mass, neither can contain 
the other. Therefore, the ellipse must protrude outside the region defined by 
the one-dimensional confidence interval. 


It can be seen from (5.27) that the orientation of a confidence ellipse and 
the relative lengths of its axes are determined by Var(@2). When the two 
parameter estimates are positively correlated, the ellipse will be oriented from 
lower left to upper right. When they are negatively correlated, it will be 
oriented from upper left to lower right, as in Figure 5.3. When the correlation 
is zero, the axes of the ellipse will be parallel to the coordinate axes. The 
variances of the two parameter estimates determine the height and width of 
the ellipse. If the variances are equal and the correlation is zero, the confidence 
ellipse will be a circle. 


Asymptotic and Bootstrap Confidence Regions 


When test statistics like (5.26), with known finite-sample distributions, are 
not available, the easiest way to construct an approximate confidence region 
is to base it on the statistic (5.18), which can be used with any k-vector of 
parameter estimates @ that is root-n consistent and asymptotically normal 
and has a covariance matrix that can be consistently estimated by Var(@). If 
Ca denotes the 1 — a quantile of the y?(k) distribution, then an approximate 


1 — a confidence region is the set of all Oo such that 
(Ê — @y)'(Var(6)) (Ô — 00) < ca. (5.28) 


Like the exact confidence region defined by (5.27), this asymptotic confidence 
region will be elliptical or ellipsoidal. 


We can also use the statistic (5.18) to construct bootstrap confidence regions, 
making the same assumptions as were made above about Ê and Var(6). As we 
did for bootstrap confidence intervals, we use just one bootstrap DGP, either 
parametric or semiparametric, characterized by the parameter vector 6. For 
each of B bootstrap samples, indexed by j, we obtain a vector of parameter 
estimates 67 and an estimated covariance matrix Var*(0ž), in just the same 
way as @ and Var(@) were obtained from the original data. For each j, we 
compute the bootstrap “test statistic” 


T} = (0; — 6)'(Var*(05)) | (6; — ô), (5.29) 
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which is the multivariate analog of (5.15). We then find the bootstrap critical 
value ci, which is the 1 — a quantile of the EDF of the 77. This is done by 


sorting the 7; from smallest to largest and then taking the entry numbered 
(B+ 1)(1— a), assuming of course that a(B + 1) is an integer. For example, 
if B = 999 and a = .05, c* will be the 950" entry in the sorted list. Then 


the bootstrap confidence region is defined as the set of all 09 such that 


A A 


(ô — 0o)"(Var(8)) "(6 — 00) < c%. (5.30) 


Q 


It is no accident that the bootstrap confidence region defined by (5.30) looks 
very much like the asymptotic confidence region defined by (5.28). The only 
difference is that the critical value ca, which appears on the right-hand side 
of (5.28), comes from the asymptotic distribution of the test statistic, while 
the critical value c¥, which appears on the right-hand side of (5.30), comes 
from the empirical distribution of the bootstrap samples. Both confidence 
regions will have the same elliptical shape. When c*, > Ca, the region defined 
by (5.30) will be larger than the region defined by (5.28), and the opposite 


will be true when c%, < Ca. 


Although this procedure is similar to the studentized bootstrap procedure 
discussed in Section 5.3, its true analog is the procedure for obtaining a sym- 
metric bootstrap confidence interval that is the subject of Exercise 5.7. That 
procedure yields a symmetric interval because it is based on the square of 
the t statistic. Similarly, because this procedure is based on the quadratic 
form (5.18), the bootstrap confidence region defined by (5.30) is forced to 
have the same elliptical shape (but not the same size) as the asymptotic con- 
fidence region defined by (5.28). Of course, such a confidence region cannot 
be expected to work very well if the finite-sample distribution of Ê does not 
in fact have contours that are approximately elliptical. 


In view of the many ways in which bootstrap confidence intervals can be 
constructed, it should come as no surprise to learn that there are also many 
other ways to construct bootstrap confidence regions. See Davison and Hink- 
ley (1997) for references and a discussion of some of these. 


5.5 Heteroskedasticity-Consistent Covariance Matrices 


All the testing procedures we have used in this chapter and the preceding 
one make use, implicitly if not explicitly, of standard errors or estimated 
covariance matrices. If we are to make reliable inferences about the values of 
parameters, these estimates should be reliable. In our discussion of how to 
estimate the covariance matrix of the OLS parameter vector @ in Sections 3.4 
and 3.6, we made the rather strong assumption that the error terms of the 
regression model are IID. This assumption is needed to show that s?(X'X)—1, 
the usual estimator of the covariance matrix of B, is consistent in the sense 
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of (5.22). However, even without the IID assumption, it is possible to obtain 
a consistent estimator of the covariance matrix of 8. 


In this section, we treat the case in which the error terms are independent 
but not identically distributed. We focus on the linear regression model with 
exogenous regressors, 


y=XB+u, E(u)=0, E(uu')=2, (5.31) 


where 2, the error covariance matrix, is an n x n matrix with t diagonal 
element equal to w? and all the off-diagonal elements equal to 0. Since X 
is assumed to be exogenous, the expectations in (5.31) can be treated as 
conditional on X. Conditional on X, then, the error terms in (5.31) are 
uncorrelated and have mean 0, but they do not have the same variance for all 
observations. These error terms are said to be heteroskedastic, or to exhibit 
heteroskedasticity, a subject of which we spoke briefly in Section 1.3. If, 
instead, all the error terms do have the same variance, then, as one might 
expect, they are said to be homoskedastic, or to exhibit homoskedasticity. 
Here we assume that the investigator knows nothing about the w?. In other 
words, the form of the heteroskedasticity is completely unknown. 


The assumption in (5.31) that X is exogenous is fairly strong, but it is often 
reasonable for cross-section data, as we discussed in Section 3.2. We make 
it largely for simplicity, since we would obtain essentially the same asymp- 
totic results if we replaced it with the weaker assumption (3.10) that X is 
predetermined, that is, the assumption that E(u; |X;) = 0. When the data 
are generated by a DGP that belongs to (5.31) with G = Go, the exogeneity 
assumption implies that 6 is unbiased; recall (3.09), which in no way depends 
on assumptions about the covariance matrix of the error terms. 


Whatever the form of the error covariance matrix 92, the covariance matrix 
of the OLS estimator 8 is equal to 


E((8 — Bo)(8 — Bo)") = (XTX) XTE(uu')X (XTX)? 
= (X'X) 1 x'ax(x xy. (5.32) 


This form of covariance matrix is often called a sandwich covariance matrix, 
for the obvious reason that the matrix X'QX is sandwiched between the 
two instances of the matrix (X'X)~!. The covariance matrix of an inefficient 
estimator very often takes this sandwich form. We can see intuitively why the 
OLS estimator is inefficient when there is heteroskedasticity by noting that 
observations with low variance presumably convey more information about the 
parameters than observations with high variance, and so the former should 
be given greater weight in an efficient estimator. 


If we knew the w?, we could easily evaluate the sandwich covariance matrix 
(5.32). In fact, as we will see in Chapter 7, we could do even better and 
actually obtain efficient estimates of B. But it is assumed that we do not 
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know the w?. Moreover, since there are n of them, one for each observation, 
we cannot hope to estimate the w? consistently without making additional 
assumptions. Thus, at first glance, the situation appears hopeless. However, 
even though we cannot evaluate (5.32), we can estimate it without having to 
attempt the impossible task of estimating 92 consistently. 


For the purposes of asymptotic theory, we wish to consider the covariance 
matrix, not of 8, but rather of n!/ 2(8 — Bo). This is just the limit of n times 
the matrix (5.32). By distributing factors of n in such a way that we can take 
limits of each of the factors in (5.32), we find that the asymptotic covariance 
matrix of n!/2(@ — Gp) is 


z zi 
lim (+ XX) i lim (+ XT2X) lim (3 XTX) (5.33) 
Under assumption (4.49), the factor lim(n~!X'X)~!, which appears twice in 
(5.33) as the bread in the sandwich,! tends to a finite, deterministic, positive 
definite matrix (Sxtx)~!. To estimate the limit, we can simply use the matrix 
(n-!X'X)~? itself. What is not so trivial is to estimate the middle factor, 
lim(n-!X'QX), the filling in the sandwich. In a very famous paper, White 
(1980) showed that, under certain conditions, including the existence of the 
limit, this matrix can be estimated consistently by 


+ X'OX, (5.34) 


where 92 is an inconsistent estimator of R. As we will see, there are several 
admissible versions of 2. The simplest version, and the one suggested in 
White (1980), is a diagonal matrix with ¢'® diagonal element equal to @?, the 
t* squared OLS residual. 


The kx k matrix lim(n~1X'QX), which is the middle factor of (5.33), is sym- 
metric. Therefore, it has only $(k? + k) distinct elements. Since this number 
is independent of the sample size, this matrix can be estimated consistently. 


Its i7*® element is 
‘ 1 
lim G `> XX) (5.35) 


This is to be estimated by the ij*® element of (5.34), which, for the simplest 
version of 2, is 


1 A 
=) 0 KaKa: (5.36) 


t=1 


l Tt is a moot point whether to call this limit an ordinary limit, as we do here, or 
a probability limit, as we do in Section 4.5. The difference reflects the fact that, 
there, X is generated by some sort of DGP, usually stochastic, while here, we 
do everything conditional on X. We would, of course, need probability limits 
if X were merely predetermined rather than exogenous. 
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Because B is consistent for Bo, a is consistent for us, and &? is therefore 
consistent for u?. Thus, asymptotically, expression (5.36) is equal to 


1 il 
n > U, Xi Xij S > (w? + U4) Xi Xt; 
t=1 t=1 


= 1 Sw? Xe Xe T - `> VtXtiXtj, 


t=1 t=1 


(5.37) 


where vs is defined to equal u? minus its mean of w?. Under suitable assump- 
tions about the X;; and the w? , we can apply a law of large numbers to the 
second term in the second line of (5.37); see White (1980, 1984) for details. 
Since v; has mean 0 by construction, this term converges to 0, while the first 
term converges to (5.35). 


The above argument shows that (5.37) tends in probability to (5.35). Because 
(5.37) is asymptotically equivalent to (5.36), the latter also tends in proba- 
bility to (5.35). Consequently, we can use (5.34), the matrix with typical 
element (5.36), to estimate lim(n~1X'QX) consistently, and the matrix 


Gi XO Xa XOX GX AY (5.38) 


to estimate (5.33) consistently. Of course, in practice, we will ignore the 
factors of n~' and use the matrix 


Varn (ô) = (XTX) 1 X'TAX( x xy! (5.39) 


directly to estimate the covariance matrix of G2 It is not difficult to modify 
the arguments on asymptotic normality of the previous section so that they 
apply to the model (5.31). Therefore, we conclude that the OLS estimator is 
root-n consistent and asymptotically normal, with (5.39) being a consistent 
estimator of its covariance matrix. 


The sandwich estimator (5.39) that we have just derived is an example of 
a heteroskedasticity-consistent covariance matrix estimator, or HCCME for 
short. It was introduced to econometrics by White (1980), although there 
were some precursors in the statistics literature, notably Eicker (1963, 1967) 
and Hinkley (1977). By taking square roots of the diagonal elements of (5.39), 
we can obtain standard errors that are asymptotically valid in the presence 
of heteroskedasticity of unknown form. These heteroskedasticity-consistent 
standard errors, which may also be referred to as heteroskedasticity-robust, 
are often enormously useful. 


2 The HCCME (5.39) depends on Q only through X'ÊX, which is a symmetric 
k x k matrix. Notice that we can compute the latter directly by calculating 
k(k + 1)/2 quantities like (5.36) without the factor of n™t. 
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Alternative Forms of HCCME 


The original HCCME (5.39) that uses squared residuals to estimate the diag- 
onals of 92 is often called HCo. However, it is not the best possible covariance 
matrix estimator, because, as we saw in Section 3.6, least squares residuals 
tend to be too small. There are several better estimators that inflate the 
squared residuals slightly so as to offset this tendency. Three straightforward 
ways of estimating the w? are the following: 


e Use ti? (n/ (n — k)), thus incorporating a degrees-of-freedom correction. 
In practice, this means multiplying the entire matrix (5.39) by n/(n— k). 
The resulting HCCME is often called HC;. 


e Use û2/(1 — ht), where hy = Xi(X' X) iX? is the tt? diagonal element of 
the “hat” matrix Px that projects orthogonally on to the space spanned 
by the columns of X. Recall the result (3.44) that, when the variance 
of all the u; is o°, the expectation of &? is o7(1—h,). Therefore, the 
ratio of a? to 1 — hy would have expectation o? if the error terms were 
homoskedastic. The resulting HCCME is often called HC. 


e Use &?/(1 — h+)”. This is a slightly simplified version of what one gets 
by employing a statistical technique called the jackknife. Dividing by 
(1 — ht)? may seem to be overcorrecting the residuals. However, when 
the error terms are heteroskedastic, observations with large variances will 
tend to influence the estimates a lot, and they will therefore tend to have 
residuals that are very much too small. Thus, this estimator, which yields 
an HCCME that is often called HC3, may be attractive if large variances 
are associated with large values of hy. 


The argument used in the preceding subsection for HCo shows that all of 
these procedures will give the correct answer asymptotically, but none of them 
can be expected to do so in finite samples. In fact, inferences based on any 
HCCME, especially HCo, may be seriously inaccurate even in samples of 
moderate size. 


It is not clear which of the more sophisticated procedures will work best in any 
particular case, although they can all be expected to work better than simply 
using the squared residuals without any adjustment. When some observations 
have much higher leverage than others, the methods that use the h; might be 
expected to work better than simply using a degrees-of-freedom correction. 
These methods were first discussed by MacKinnon and White (1985), who 
found some evidence that the jackknife seemed to work best. Later simulations 
by Long and Ervin (2000) also support the use of HC3. However, theoretical 
work by Chesher (1989) and Chesher and Austin (1991) gave more ambiguous 
results and suggested that HC2 might sometimes outperform HC3. It appears 
that the best procedure to use depends on the X matrix and on the form of 
the heteroskedasticity. 
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When Does Heteroskedasticity Matter? 


Even when the error terms are heteroskedastic, there are cases in which we 
do not necessarily have to use an HCCME. Consider the ijt element of 
n—1X'QX, which is 


1 
T 2 w, XuiXtj. (5.40) 
t=1 
If the limit as n — oo of the average of the w?, t = 1,...,n, exists and is 


denoted o?, then (5.40) can be written as 


ot `> XtiXtj + Dw -0° )Xti Xiz. 
t=1 


The first term here is just the ij*" element of o?n~1X'X. Should it be the 
case that 


lim 4X (w? — 0?) XiX1j = 0 (5.41) 
for i,j =1,...,k, then we find that 


lim (3 xax) = =o? lim (2 XTX). (5.42) 


n— CO n— Co 


In this special case, we can replace the middle term of (5.33) by the right- 
hand side of (5.42), and we find that the asymptotic covariance matrix of 
n'/2(B — Bo) is just 
inf Lge) aee fb athe) ge) haat ey San. Jl eres 
lim (4 XTX) o? lim (£X7X) lim (EXTX) = 07 lim (4 XX) 
noo \n noo \ n noo \ n noo \ n 


The usual OLS estimate of the error variance is 


LLa 
Ti 
n 


and, if we assume that we can apply a law of large numbers, the probability 
limit of this is 


: 1 
Jim = > =0°, (5.43) 


by definition. Thus we see that, in this special case, the usual OLS covariance 
matrix estimator (3.50) will be valid asymptotically. This important result 
was originally shown by White (1980). 


Equation (5.41) always holds when we are estimating only a sample mean. In 
that case, X = ų, a vector with typical element 1; = 1, and 


n 
2_1 2 2 
2 wi XeiX = = o w = > WwW, —> of as n —> OO. 
= 
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This shows that we do not have to worry about heteroskedasticity when cal- 
culating the standard error of a sample mean. Of course, equation (5.41) also 
holds when the error terms are homoskedastic. In that case, the g? given 
by (5.43) is just the variance of each of the error terms. 


Although equation (5.41) holds only in certain special cases, it does make 
one thing clear. Any form of heteroskedasticity affects the efficiency of the 
ordinary least squares parameter estimator, but only heteroskedasticity that 
is related to the squares and cross-products of the X;; affects the validity of 
the usual OLS covariance matrix estimator. 


HAC Covariance Matrix Estimators 


All HCCMEs depend on the assumption that 2 is diagonal. We are able to 
compute them because we can consistently estimate the matrix n~1X'QX, 
even though we cannot consistently estimate the matrix 2 itself. For essen- 
tially the same reason, we can obtain valid covariance matrix estimators even 
when 2 is not a diagonal matrix. However, in order for us to be able to 
estimate n~!X'QX consistently when Q is unknown and is not diagonal, all 
the off-diagonal elements which are not close to the principal diagonal must 
be sufficiently small. 


When the error terms of a regression model are correlated among themselves, 
then, as we mentioned in Section 1.3, they are said to display serial correla- 
tion or autocorrelation. Serial correlation is frequently encountered in models 
estimated using time series data. Often, observations that are close to each 
other are strongly correlated, but observations that are far apart are uncor- 
related or nearly so. In this situation, only the elements of Q that are on 
or close to the principal diagonal will be large. When this is the case, we 
may be able to obtain an estimate of the covariance matrix of the parameter 
estimates that is heteroskedasticity and autocorrelation consistent, or HAC. 
Computing a HAC covariance matrix estimator is essentially similar to com- 
puting an HCCME, but a good deal more complicated. HAC estimators will 
be discussed in Chapter 9. 


5.6 The Delta Method 


Econometricians often want to perform inference on nonlinear functions of 
model parameters. This requires them to estimate the standard error of a 
nonlinear function of parameter estimates or, more generally, the covariance 
matrix of a vector of such functions. One popular way to do so is called the 
delta method. It is based on an asymptotic approximation. 


For simplicity, let us start with the case of a single parameter. Suppose that we 
have estimated a scalar parameter 0, which might be one of the coefficients of a 
linear regression model, and that we are interested in the parameter y = g(0), 
where g(-) is a monotonic function that is continuously differentiable. In this 
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Figure 5.4 Taylor’s Theorem 


situation, the obvious way to estimate y is to use VY = (0). Since 6 is a random 
variable, so is 4. The problem is to estimate the variance of ¥. 


Since ¥ is a function of 6, it seems logical that Var(ĝ) should be a function of 
Var (0). If g(@) is a linear or affine function, then we already know how to cal- 
culate Var(Ẹ); recall the result (3.33). The idea of the delta method is to find 
a linear approximation to g(@) and then apply (3.33) to this approximation. 


Taylor’s Theorem 


It is frequently necessary in econometrics to obtain linear approximations 
to nonlinear functions. The mathematical tool most commonly used for this 
purpose is Taylor’s Theorem. In its simplest form, Taylor’s Theorem applies to 
functions of a scalar argument that are differentiable at least once on some real 
interval [a,b], with the derivative a continuous function on [a,b]. Figure 5.4 
shows the graph of such a function, f(x), for x € [a,b]. 


The coordinates of A are (a, f(a)), and those of B are (b, f(b)). Thus the 
slope of the line AB is (f(b) — f(a))/(b— a). What drives the theorem is the 
observation that there must always be a value between a and b, like c in the 
figure, at which the derivative f’(c) is equal to the slope of AB. This is a 
consequence of the continuity of the derivative. If it were not continuous, and 
the graph of f(x) had a corner, the slope might always be greater than f’(c) 
on one side of the corner, and always be smaller on the other. But if f'(x) is 
continuous on [a,b], then there must exist c such that 


ro- 0 
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This can be rewritten as f(b) = f(a) + (b — a) f' (c). If we let h = b—a, then, 
since c lies between a and b, it must be the case that c = a + th, for some t 
between 0 and 1. Thus we obtain 


f(a+h) = f(a) +hf' (at th). (5.44) 


Equation (5.44) is the simplest expression of Taylor’s Theorem. 


Although (5.44) is an exact relationship, it involves the quantity t, which 
is unknown. It is more usual just to set t = 0, so as to obtain a linear 
approximation to the function f(a) for x in the neighborhood of a. This 
approximation, called a first-order Taylor expansion around a, is 


fat+h) = fla) + hf'(a), 


where the symbol “=” means “is approximately equal to.” The right-hand 
side of this equation is an affine function of h. 


Taylor’s Theorem can be extended in order to provide approximations that 
are quadratic or cubic functions, or polynomials of any desired order. The 
exact statement of the theorem, with terms proportional to powers of h up 
to h”, is 


Fath) = fa) +S) PFO @) + FF (a+ th). 
i=1 ~ ` 


Here f is the it” derivative of f, and once more 0 < t < 1. The approximate 
version of the theorem sets t = 0 and gives rise to a p*}-order Taylor expansion 
around a. A commonly-encountered example of the latter is the second-order 
Taylor expansion 


flath)& fla) +hf'(a) + Sh? f(a). 


Both versions of Taylor’s Theorem require as a regularity condition that f(x) 
should have a pt? derivative that is continuous on fa, a + A]. 


There are also multivariate versions of Taylor’s Theorem, and we will need 
them from time to time. If f(a) is now a scalar-valued function of the 
m-vector x, then, for p = 1, Taylor’s Theorem states that, if h is also an 
m-—vector, then 


f(a+h) = f(x) +S hyf;(a + th), (5.45) 


j=l 


where h; is the j*™ component of h, fj is the partial derivative of f with 
respect to its j'® argument, and, as before, 0 < t < 1. 
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The Delta Method for a Scalar Parameter 


If we assume that the estimator 6 is root-n consistent and asymptotically 
normal, then 


n'/?(6 — 69) © N(0,V™(6)), (5.46) 


where ĝo denotes the true value of 0. We will use v~ (6) as a shorthand way 
of writing the asymptotic variance of n1/?(@ — 6). 


In order to find the asymptotic distribution of 7 = gÔ), we perform a first- 


A 


order Taylor expansion of g(@) around ĝo. We obtain: 
7 = g(60) + g' (80) (Ê — 80), (5.47) 


where g'(0o) is the first derivative of g(@), evaluated at 09. Given the root-n 
consistency of 6, (5.47) can be rearranged into an asymptotic equality. Two 
deterministic quantities are said to be asymptotically equal if they tend to 
the same limits as n — oo. Similarly, two random quantities are said to be 
asymptotically equal if they tend to the same limits in probability. As usual, 
we need a power of n to make things work correctly. Here, we multiply both 
sides of (5.47) by n!/?. If we denote g(0o), which is the true value of y, by yo, 
then (5.47) becomes 


n¥/2(4 — 99) & ghn'/?(6 — 60), (5.48) 


where the symbol = is used for asymptotic equality, and g4 = g'(0o). In 
Exercise 5.13, readers are asked to check that, if we perform a second-order 
Taylor expansion, the last term of the expansion vanishes asymptotically. This 
justifies (5.48) as an asymptotic equality. 


Equation (5.48) shows that n!/?(4—4 9) is asymptotically normal with mean 0, 
since the right-hand side of (5.48) is just gj times a quantity that is asymp- 
totically normal with mean 0; recall (5.46). The variance of ni/2(4 — yo) is 
clearly (g4)? V° (0), and so we conclude that 


n¥/?(4 — Yo) ~ N (0, (95)?V°(8)). (5.49) 


This shows that ¥ is root-n consistent and asymptotically normal when ĝ is. 


The result (5.49) leads immediately to a practical procedure for estimating 
the standard error of 7. If the standard error of @ is sg, then the standard 
error of ¥ will be 

sy = |9'(9)| so. (5.50) 


This procedure can be based on any asymptotically valid estimator of the 
standard deviation of 6. For example, if 9 were one of the coefficients of a 
linear regression model, then sg could be the square root of the corresponding 
diagonal element of the usual estimated OLS covariance matrix, or it could 
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be the square root of the corresponding diagonal element of an estimated 
heteroskedasticity-consistent covariance matrix. 


In practice, the delta method is usually very easy to use. For example, consider 
the case in which y = 8°. Then g'(0) = 20, and the formula (5.50) tells us 
that s = 2|0|s9. Notice that s, depends on 6, something that is not true for 
either the usual OLS standard error or the heteroskedasticity-consistent one 
discussed in the preceding section. 


Confidence Intervals and the Delta Method 


Although the result (5.50) is simple and practical, it reveals some of the lim- 
itations of asymptotic theory. Whenever the relationship between 6 and Â is 
nonlinear, it is impossible that both of them should be normally distributed in 
finite samples. Suppose that 0 really did happen to be normally distributed. 
Then, unless g(-) were linear, 7 could not possibly be normally, or even sym- 
metrically, distributed. Similarly, if 4 were normally distributed, Ê could not 
be. Moreover, as the example at the end of the last subsection showed, s, 
will generally depend on 0. This implies that the numerator of a t statistic 
for y will not be independent of the denominator. However, independence 
was essential to the result, in Section 4.4, that the t statistic actually follows 
the Student’s t distribution. 


The preceding arguments suggest that confidence intervals and test statis- 
tics based on asymptotic theory will often not be reliable in finite samples. 
Asymptotic normality of the parameter estimates is an essential underpinning 
of all asymptotic tests and confidence intervals or regions. When the finite- 
sample distributions of estimates are far from the limiting normal distribution, 
asymptotic procedures cannot be expected to perform well. 


Despite these caveats, we may still wish to construct an asymptotic confidence 
interval for y based on (5.08). The result is 


[4 — 8y21-(a/2)> Y+ Sy21-(a/2)]> (5.51) 


where s, is the delta method estimate (5.50), and z_(q/2) is the 1 — (a/2) 
quantile of the standard normal distribution. This confidence interval can 
be expected to work well whenever the finite-sample distribution of ¥ is well 
approximated by the normal distribution and są is a reliable estimator of its 
standard deviation. 


Using (5.08) is not the only way to obtain an asymptotic confidence interval 
for y, however. Another approach, which usually leads to an asymmetric 
interval, is to transform the asymptotic confidence interval for the underlying 
parameter 0. The latter interval, which is similar to (5.08), is 


A 


[Â — sez1—(a/2), Ô+ $621-(a/2)]- 
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Transforming the endpoints of this interval by the function g gives the follow- 
ing interval for y: 


A 


[g(0 — 5921- (a/2)), g(ê — 5021-(a/2))|. (5.52) 


This assumes that g'(0) > 0. If g'(0) < 0, the two ends of the interval 
would have to be interchanged. Whenever g(0) is a nonlinear function, the 
confidence interval (5.52) will be asymmetric. It can be expected to work 
well if the finite-sample distribution of Ê is well approximated by the normal 
distribution and sọ is a reliable estimator of the standard deviation of 0. 


The bootstrap confidence interval for 0, (5.17), can also be transformed by g 
in order to obtain a bootstrap confidence interval for y. The result is 


A A 


[98 = sect_(a/2)), IO — soca) (5.53) 


where cf /2 and cj_(q/2) are, as in (5.17), the entries indexed by (a/2)(B + 1) 
and (1 — (a/2))(B +1) in the sorted list of bootstrap t statistics t}. 


Yet another way to construct a bootstrap confidence interval is to bootstrap 
the t statistic for y directly. Using the original data, we compute 0 and sọ, 
and then ¥ and s} in terms of them. The bootstrap DGP is the same as the 
one used to obtain a bootstrap confidence interval for 0, but this time, for each 
bootstrap sample 7, j = 1,...,B, we compute Vi and (sy)¥. The bootstrap 
“t statistics” (7; —4)/(sy)j are then sorted. If (cy)%/2 and (c,)j_(a/2) denote 
the entries indexed by (a/2)(B +1) and (1 — (a/2))(B +1) in the sorted list, 
then the (asymmetric) bootstrap confidence interval is 


4 E 84(Cy)1_(a/2)> om 84(Cy) 2] . (5.54) 


As readers are asked to check in Exercise 5.16, the intervals (5.53) and (5.54) 
are not the same. 


The Vector Case 

The result (5.49) can easily be extended to the case in which both 0 and y are 
vectors. Suppose that the former is a k-vector and the latter is an l-vector, 
with | < k. The relation between 0 and y is y = g(0), where g(0) is an 
l-vector of monotonic functions that are continuously differentiable. The 
vector version of (5.46) is 


n!/?(ô — 0o) © N(0,V™(8)), (5.55) 
where V% (0) is the asymptotic covariance matrix of the vector n!/ 2(0 — 0o). 
Using the result (5.55) and a first-order Taylor expansion of g(@) around 6, 
it can be shown that the vector analog of (5.49) is 


ni? (4 — y0) © N (0, G@oV™(8) Go ), (5.56) 
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where Go is an l x k matrix with typical element 0g;(@)/00;, evaluated at 60; 
see Exercise 5.14. The asymptotic covariance matrix that appears in (5.56) is 


an l x l matrix. It has full rank l if V°°(@) is nonsingular and the matrix of 
derivatives Go has full rank l. 


In practice, the covariance matrix of 7 may be estimated by the matrix 
Var(4) = G Var(6)G", (5.57) 


where Var(@) is the estimated covariance matrix of Ô, and G = G(6). This 
result, which is similar to (3.33), can be very useful. However, like all results 
based on asymptotic theory, it should be used with caution. As in the scalar 
case discussed above, y cannot possibly be normally distributed if @ is. 


Bootstrap Standard Errors 


The delta method is not the only way to obtain standard errors and covariance 
matrices for functions of parameter estimates. The bootstrap can also be used 
for this purpose. Indeed, much of the early work on the bootstrap, such as 
Efron (1979), was largely concerned with bootstrap standard errors. 


Suppose that, as in the previous subsection, we wish to calculate the covar- 


A 


iance matrix of the vector 7 = g(@). A bootstrap procedure for doing this is: 


1. Specify a bootstrap DGP, which may be parametric or semiparametric, 
and use it to generate B bootstrap samples, y}. 


2. For each bootstrap sample, use y; to compute the parameter vector 05, 
and then use 67 to compute +;. 


3. Calculate 7*, the mean of the y}. Then calculate the estimated bootstrap 
covariance matrix, 


B 
<7 KA 1 * —* * —* 
Va (= TTT: 
= 


If desired, bootstrap standard errors may be calculated as the square 
roots of the diagonal elements of this matrix. 


Bootstrap standard errors, which may or may not be more accurate than ones 
based on asymptotic theory, can certainly be useful as descriptive statistics. 
However, using them for inference generally cannot be recommended. In 
many cases, calculating bootstrap standard errors is almost as much work as 
calculating studentized bootstrap confidence intervals. As we noted at the 
end of Section 5.3, there are theoretical reasons to believe that the latter will 
yield more accurate inferences than confidence intervals based on asymptotic 
theory, including asymptotic confidence intervals that use bootstrap standard 
errors. Thus, if we are going to go to the trouble of calculating a large number 
of bootstrap estimates anyway, we can do better than just using them to 
compute bootstrap standard errors. 
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5.7 Final Remarks 


In this chapter, we have discussed a number of methods for constructing confi- 
dence intervals. They are all based on the idea of inverting a test statistic, and 
most of them are in no way restricted to OLS estimation. We first construct a 
family of test statistics for the null hypotheses that the parameter of interest 
is equal to a particular value, and then the limits of the confidence interval are 
obtained by solving the equation that sets the statistic equal to the critical 
values given by some appropriate distribution. The critical values may be 
quantiles of a finite-sample distribution, such as the Student’s t distribution, 
quantiles of an asymptotic distribution, such as the standard normal distribu- 
tion, or quantiles of a bootstrap EDF. Procedures for constructing confidence 
regions are very similar to those for constructing confidence intervals. 


All of the methods for constructing confidence intervals and regions that we 
have discussed require standard errors or, more generally, estimated covar- 
iance matrices. The chapter therefore includes a good deal of material on 
how to estimate these under weaker assumptions than were made in Chap- 
ter 3. Much of this material is widely applicable. Methods for estimation of 
covariance matrices in the presence of heteroskedasticity of unknown form, 
similar to those discussed in Section 5.5, are useful in the context of many 
different methods of estimation. The delta method, which was discussed in 
Section 5.6, is even more general, since it can be used whenever one parameter, 
or vector of parameters, is a nonlinear function of another. 


5.8 Exercises 


5.1 Find the .025, .05, .10, and .20 quantiles of the standard normal distribution. 
Use these to obtain whatever quantiles of the y?(1) distribution you can. 


5.2 Starting from the square of the t statistic (5.11), and using the F(1,n — k) 
distribution, obtain a .99 confidence interval for the parameter (2 in the 
classical normal linear model (4.21). 


5.3 The file earnings.data contains sorted data on four variables for 4266 indi- 
viduals. One of the variables is income, y, and the other three are dummy 
variables, dı, d2, and d3, which correspond to different age ranges. Regress y 
on all three dummy variables. Then use the regression output to construct 
a .95 asymptotic confidence interval for the mean income of individuals that 
belong to age group 3. 


5.4 Using the same data as Exercise 5.3, regress y on a constant for individuals 
in age group 3 only. Use the regression output to construct a .95 asymptotic 
confidence interval for the mean income of group 3 individuals. Explain why 
this confidence interval is not the same as the one you constructed previously. 


5.5 Generate 999 realizations of a random variable that follows the x? (2) distri- 
bution, and find the .95 and .99 “quantiles” of the EDF, that is the 950th 
and 990*" entries in the sorted list of the realizations. Compare these with 
the .95 and .99 quantiles of the y7(2) distribution. 
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5.8 


5.9 


5.10 


5.11 


5.12 


Confidence Intervals 


Using the data in the file earnings.data, construct a .95 studentized bootstrap 
confidence interval for the mean income of group 3 individuals. Explain why 
this confidence interval differs from the one you constructed in Exercise 5.4. 


Explain in detail how to construct a symmetric bootstrap confidence interval 
based on the possibly asymptotic t statistic (0 — 09)/sg. Express your answer 
in terms of entries in a sorted list of bootstrap t statistics. 


Show that the F statistic for the null hypothesis that G2 = B29 in the model 
(5.24), or, equivalently, for the null hypothesis that y2 = 0 in (5.25), can be 
written as (5.26). Interpret the numerator of expression (5.26) as a random 
variable constructed from the multivariate normal vector Bo. 


Consider a regression model with just two centered explanatory variables, 71 
and 29: 
y= Bix", + Bown + U. (5.58) 


Let p denote the sample correlation of xı and x2. By the sample correlation, 
we mean 
2 Pi Xi X12 
P — 1/2 bi 
(EE XA) 


where X;+1 and X;2 are typical elements of x; and a9, respectively. This can 
be interpreted as the correlation of the joint EDF of xı and ao. 


Show that, under the assumptions of the classical normal linear model, the 
correlation between the OLS estimates 6; and (2 is equal to —p. Which, if 
any, of the assumptions of this model can be relaxed without changing this 
result? 


Consider the .05 level confidence region for the parameters 3, and (2 of the 
regression model (5.58). In the two-dimensional space §(#1, x2) generated by 
the two regressors, consider the set of points of the form $1971 +820£2, where 
(310, 820) belongs to the confidence region. Show that this set is a circular 
disk with center at the OLS estimates (2181 - £232). What is the radius of 
the disk? 


Using the data in the file earnings.data, regress y on all three dummy variables, 
and compute a heteroskedasticity-consistent standard error for the coefficient 
of d3. Using these results, construct a .95 asymptotic confidence interval for 
the mean income of individuals that belong to age group 3. Compare this 
interval with the ones you constructed in Exercises 5.3, 5.4, and 5.6. 


Generate N simulated data sets, where N is between 1000 and 1,000,000, 
depending on the capacity of your computer, from each of the following two 
data generating processes: 


DGP 1: yt = 61 + GoXt2 + 03X13 + ut, ut~ N(0,1) 
2 
DGP 2: yt = 61+ B2Xt2 + b3Xı3 + ue, us ~ N(0,07), of = (E(ys))’- 


There are 50 observations, B = [1 : 1: 1], and the data on the exogenous 
variables are to be found in the file mw.data. These data were originally used 
by MacKinnon and White (1985). 
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5.13 


5.14 


5.15 


5.16 


5.17 


For each of the two DGPs and each of the N simulated data sets, construct 
.95 confidence intervals for 61 and G2 using the usual OLS covariance matrix 
and the HCCMEs HCo, HC,, HC2, and HC3. The OLS interval should be 
based on the Student’s t distribution with 47 degrees of freedom, and the 
others should be based on the N(0,1) distribution. Report the proportion of 
the time that each of these confidence intervals included the true values of 
the parameters. 


On the basis of these results, which covariance matrix estimator would you 
recommend using in practice? 


Write down a second-order Taylor expansion of the nonlinear function (0) 
around 09, where 0 is an OLS estimator and 09 is the true value of the 
parameter 0. Explain why the last term is asymptotically negligible relative 


to the second term. 


Using a multivariate first-order Taylor expansion, show that, if y = g(@), the 
asymptotic covariance matrix of the l-vector nil? (¥ — y0) is given by the 
l x l matrix GoV~™(6)Go. Here 0 is a k-vector with k > l, Go is an l xk 
matrix with typical element 0g;(0)/00;, evaluated at ĝo, and V~(6) is the 
k x k asymptotic covariance matrix of n!/2 (ô — 0o). 


Suppose that y = exp((3) and B = 1.324, with a standard error of 0.2432. 
Calculate 47 = exp((@) and its standard error. 


Construct two different .99 confidence intervals for y. One should be based 
on (5.51), and the other should be based on (5.52). 


Construct two .95 bootstrap confidence intervals for the log of the mean in- 
come (not the mean of the log of income) of group 3 individuals from the 
data in earnings.data. These intervals should be based on (5.53) and (5.54). 
Verify that these two intervals are different. 


Use the DGP 
Yt = 0.8 y4¢—1 + Ut, Ut~ NID(O, 1) 


to generate a sample of 30 observations. Using these simulated data, obtain 
estimates of p and o? for the model 


yt = pyr-1 tut, E(u) =0, E(urus) = 07 6ts, 
where dts is the Kronecker delta introduced in Section 1.4. By use of the 


parametric bootstrap with the assumption of normal errors, obtain two .95 
confidence intervals for p, one symmetric, the other asymmetric. 
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Nonlinear Regression 


6.1 Introduction 


Up to this point, we have discussed only linear regression models. For each 
observation t of any regression model, there is an information set Q; and a 
suitably chosen vector X; of explanatory variables that belong to Q;. A linear 
regression model consists of all DGPs for which the expectation of the depen- 
dent variable y, conditional on Q, can be expressed as a linear combination 
X;(3 of the components of X;, and for which the error terms satisfy suitable 
requirements, such as being IID. Since, as we saw in Section 1.3, the elements 
of X; may be nonlinear functions of the variables originally used to define Q4, 
many types of nonlinearity can be handled within the framework of the lin- 
ear regression model. However, many other types of nonlinearity cannot be 
handled within this framework. In order to deal with them, we often need to 
estimate nonlinear regression models. These are models for which E (y| Q4) 
is a nonlinear function of the parameters. 


A typical nonlinear regression model can be written as 


y= (B) + ur uw, ~IID(0,07), t=1,...,n, (6.01) 
where, just as for the linear regression model, y; is the tt? observation on 
the dependent variable, and @ is a k-vector of parameters to be estimated. 
The scalar function x;(@) is a nonlinear regression function. It determines 
the mean value of yp conditional on Q;, which is made up of some set of 
explanatory variables. These explanatory variables, which may include lagged 
values of y; as well as exogenous variables, are not shown explicitly in (6.01). 
However, the t subscript of x;() indicates that the regression function varies 
from observation to observation. This variation usually occurs because 2;(3) 
depends on explanatory variables, but it can also occur because the functional 
form of the regression function actually changes over time. The number of 
explanatory variables, all of which must belong to Q, need not be equal to k. 


The error terms in (6.01) are specified to be IID. By this, we mean something 
very similar to, but not precisely the same as, the two conditions in (4.48). In 
order for the error terms to be identically distributed, the distribution of each 
error term u+, conditional on the corresponding information set Q;, must be 
the same for all t. In order for them to be independent, the distribution of u+, 
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conditional not only on Q; but also on all the other error terms, should be 
the same as its distribution conditional on Q; alone, without any dependence 
on the other error terms. 


Another way to write the nonlinear regression model (6.01) is 
y=2(8)+u, u-~TIID(0,o7I), (6.02) 


where y and u are n-vectors with typical elements y; and uz, respectively, 
and «(3) is an n-vector of which the t'® element is x;(). Thus 2(@) is the 
nonlinear analog of the vector X8 in the linear case. 


As a very simple example of a nonlinear regression model, consider the model 


1 
yt = bı + b2Zt1 + Bye + uz, w ~ IID(0,o7), (6.03) 
2 


where Z; and Z2 are explanatory variables. For this model, 


xıl B) = bı + b2Zt1 + ZZ 
2 


Although the regression function x;() is linear in the explanatory variables, 
it is nonlinear in the parameters, because the coefficient of Z;2 is constrained 
to equal the inverse of the coefficient of Z+}. In practice, many nonlinear 
regression models, like (6.03), can be expressed as linear regression models in 
which the parameters must satisfy one or more nonlinear restrictions. 


The Linear Regression Model with AR(1) Errors 


We now consider a particularly important example of a nonlinear regression 
model that is also a linear regression model subject to nonlinear restrictions 
on the parameters. In Section 5.5, we briefly mentioned the phenomenon of 
serial correlation, in which nearby error terms in a regression model are (or 
appear to be) correlated. Serial correlation is very commonly encountered in 
applied work using time-series data, and many techniques for dealing with it 
have been proposed. One of the simplest and most popular ways of dealing 
with serial correlation is to assume that the error terms follow the first-order 
autoregressive, or AR(1), process 


Us = put- +E, Et ~ ID(0,02), |p| <1. (6.04) 


According to this model, the error at time t is equal to p times the error at 
time t — 1, plus a new error term e. The vector e with typical component €+ 
satisfies the IID condition we discussed above. This condition is enough for €+ 
to be an innovation in the sense of Section 4.5. Thus the £, are homoskedastic 
and independent of all past and future innovations. We see from (6.04) that, 
in each period, part of the error term wu; is the previous period’s error term, 
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shrunk somewhat toward zero and possibly changed in sign, and part is the 
innovation €+. We will discuss serial correlation, including the AR(1) process 
and other autoregressive processes, in Chapter 7. At present, we are concerned 
solely with the nonlinear regression model that results when the errors of a 
linear regression model are assumed to follow an AR(1) process. 


If we combine (6.04) with the linear regression model 
Ut = XB + Ut (6.05) 


by substituting pu;_; + €; for u, and then replacing uz; by y_1 — X+_1f,, 
we obtain the nonlinear regression model 


Y= Py¥r1+XB-pX:s1B+e, cr ~ ID(0, o2). (6.06) 


Since the lagged dependent variable y,;_; appears among the regressors, this 
is a dynamic model. As with the other dynamic models that are treated 
in the exercises, we have to drop the first observation, because yo and Xo 
are assumed not to be available. The model is linear in the regressors but 
nonlinear in the parameters @ and p, and it therefore needs to be estimated 
by nonlinear least squares or some other nonlinear estimation method. 


In the next section, we study estimators for nonlinear regression models gen- 
erated by the method of moments, and we establish conditions for asymptotic 
identification, asymptotic normality, and asymptotic efficiency. Then, in Sec- 
tion 6.3, we show that, under the assumption that the error terms are IID, the 
most efficient MM estimator is nonlinear least squares, or NLS. In Section 6.4, 
we discuss various methods by which NLS estimates may be computed. The 
method of choice in most circumstances is some variant of Newton’s Method. 
One commonly-used variant is based on an artificial linear regression called 
the Gauss-Newton regression. We introduce this artificial regression in Sec- 
tion 6.5 and show how to use it to compute NLS estimates and estimates of 
their covariance matrix. In Section 6.6, we introduce the important concept 
of one-step estimation. Then, in Section 6.7, we show how to use the Gauss- 
Newton regression to compute hypothesis tests. Finally, in Section 6.8, we 
introduce a modified Gauss-Newton regression suitable for use in the pres- 
ence of heteroskedasticity of unknown form. 


6.2 Method of Moments Estimators for Nonlinear Models 


In Section 1.5, we derived the OLS estimator for linear models from the 
method of moments by using the fact that, for each observation, the mean 
of the error term in the regression model is zero conditional on the vector of 
explanatory variables. This implied that 
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The sample analog of the middle expression here is n~'X '(y — X8). Setting 
this to zero and ignoring the factor of n7t, we obtained the vector of moment 
conditions 

X'(y — XB) =0, (6.08) 


and these conditions were easily solved to yield the OLS estimator Ê. We now 
want to employ the same type of argument for nonlinear models. 


An information set Q; is typically characterized by a set of variables that 
belong to it. But, since the realization of any deterministic function of these 
variables is known as soon as the variables themselves are realized, Q; must 
contain not only the variables that characterize it but also all determinis- 
tic functions of them. As a result, an information set Q; contains precisely 
those variables which are equal to their expectations conditional on Q;. In 
Exercise 6.1, readers are asked to show that the conditional expectation of a 
random variable is also its expectation conditional on the set of all determin- 
istic functions of the conditioning variables. 


For the nonlinear regression model (6.01), the error term u+ has mean 0 con- 
ditional on all variables in Q;. Thus, if W, denotes any 1 x k vector of which 
all the components belong to Q4, 


E(W;u:) = E(w, (u — O) =0. (6.09) 


Just as the moment conditions that correspond to (6.07) are (6.08), the mo- 
ment conditions that correspond to (6.09) are 


W'(y — æ(8)) = 0, (6.10) 


where W is an n x k matrix with typical row W,. There are k nonlinear 
equations in (6.10). These equations can, in principle, be solved to yield an 
estimator of the k-vector B. Geometrically, the moment conditions (6.10) 
require that the vector of residuals should be orthogonal to all the columns 
of the matrix W. 


How should we choose W? There are infinitely many possibilities. Almost 
any matrix W, of which the tt? row depends only on variables that belong 
to Q, and which has full column rank k asymptotically, will yield a consis- 
tent estimator of B. However, these estimators will in general have different 
asymptotic covariance matrices, and it is therefore of interest to see if any 
particular choice of W leads to an estimator with smaller asymptotic var- 
iance than the others. Such a choice would then lead to an efficient estimator, 
judged by the criterion of the asymptotic variance. 


Identification and Asymptotic Identification 


Let us denote by Ê the MM estimator defined implicitly by (6.10). In order to 
show that 6 is consistent, we must assume that the parameter vector 6 in the 
model (6.01) is asymptotically identified. In general, a vector of parameters 
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is said to be identified by a given data set and a given estimation method if, 
for that data set, the estimation method provides a unique way to determine 
the parameter estimates. In the present case, B is identified by a given data 
set if equations (6.10) have a unique solution. 


For the parameters of a model to be asymptotically identified by a given es- 
timation method, we require that the estimation method provide a unique 
way to determine the parameter estimates in the limit as the sample size n 
tends to infinity. In the present case, asymptotic identification can be for- 
mulated in terms of the probability limit of the vector n~'W'(y — 7(@)) as 
n — oo. Suppose that the true DGP is a special case of the model (6.02) with 
parameter vector Bo. Then we have 


tW! (y — 2(8o)) ==>) Wru. (6.11) 


By (6.09), every term in the sum above has mean 0, and the IID assumption 
in (6.02) is enough to allow us to apply a law of large numbers to that sum. It 
follows that the right-hand side, and therefore also the left-hand side, of (6.11) 
tends to zero in probability as n — oo. 


Let us now define the k-vector of deterministic functions a(@) as follows: 


a(B) = plim + W! (y — z(8)), (6.12) 


n— CoO 


where we continue to assume that y is generated by (6.02) with Bo. The law 
of large numbers can be applied to the right-hand side of (6.12) whatever the 
value of 3, thus showing that the components of œ are deterministic. In the 
preceding paragraph, we explained why a(o) = 0. The parameter vector 3 
will be asymptotically identified if Bo is the unique solution to the equations 


a(B) = 0, that is, if a(B) 4 0 for all B Æ Bo. 


Although most parameter vectors that are identified by data sets of reasonable 
size are also asymptotically identified, neither of these concepts implies the 
other. It is possible for an estimator to be asymptotically identified without 
being identified by many data sets, and it is possible for an estimator to 
be identified by every data set of finite size without being asymptotically 
identified. To see this, consider the following two examples. 


As an example of the first possibility, suppose that y = Bı + G22, where z 
is a random variable which follows the Bernoulli distribution. Such a random 
variable is often called a binary variable, because there are only two possible 
values it can take on, 0 and 1. The probability that z = 1 is p, and so 
the probability that z+ = 0 is 1 — p. If p is small, there could easily be 
samples of size n for which every z; was equal to 0. For such samples, the 
parameter G2 cannot be identified, because changing 8> can have no effect 
on yz — Bı — Boz. However, provided that p > 0, both parameters will be 
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identified asymptotically. As n — ov, a law of large numbers guarantees that 
the proportion of the z that are equal to 1 will tend to p. 


As an example of the second possibility, consider the model (3.20), discussed 
in Section 3.3, for which y+ = 61 + Go1/¢+ uz, where t is a time trend. The 
OLS estimators of (3; and (2 can, of course, be computed for any finite sample 
of size at least 2, and so the parameters are identified by any data set with 
at least 2 observations. But (2 is not identified asymptotically. Suppose that 
the true parameter values are 3° and G9. Let us use the two regressors for the 
variables in the information set Q;, so that W, = [1 1⁄4] and the MM estimator 
is the same as the OLS estimator. Then, using the definition (6.12), we obtain 


a( 81, 32) = plim | nE ((B2 — 61) + 1/4(62 — b2) + ut) 


n= | n15 p (1/4089 — b1) + 1/42((98 — Bo) + 1/pue) 
It is known that the deterministic sums n~*S>}_,(1/t) and n! (1/¢°) 
both tend to 0 as n — oo. Further, the law of large numbers tells us that the 
limits in probability of n~!) ;—; us and n~1S7"_, (ue/t) are both 0. Thus the 
right-hand side of (6.13) simplifies to 


| . (6.13) 


a(b, B2) = ~ 


0 


Since a(6ı, 32) vanishes for 3; = 6? and for any value of G2 whatsoever, we 
see that (G2 is not asymptotically identified. In Section 3.3, we showed that, 
although the OLS estimator of (32 is unbiased, it is not consistent. The simult- 
aneous failure of consistency and asymptotic identification in this example is 
not a coincidence: It will turn out that asymptotic identification is a necessary 
and sufficient condition for consistency. 


Consistency 


Suppose that the DGP is a special case of the model (6.02) with true parameter 
vector Go. Under the assumption of asymptotic identification, the equations 
a(B) = 0 have a unique solution, namely, 8 = Bo. This can be shown to imply 
that, as n — oo, the probability limit of the estimator B defined by (6.10) is 
precisely Bo. We will not attempt a formal proof of this result, since it would 
have to deal with a number of technical issues that are beyond the scope of 
this book. See Amemiya (1985, Section 4.3) or Davidson and MacKinnon 
(1993, Section 5.3) for more detailed treatments. 


However, an intuitive, heuristic, proof is not at all hard to provide. If we 
make the assumption that 8B has a deterministic probability limit, say Bo, 
the result follows easily. What makes a formal proof more difficult is showing 
that Bæ exists. Let us suppose that Bə 4 Bo. We will derive a contradiction 
from this assumption, and we will thus be able to conclude that Bæ = 6o, in 
other words, that 8 is consistent. 
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For all finite samples large enough for 8 to be identified by the data, we have, 
by the definition (6.10) of 8, that 


L wy —a(A)) =0. (6.14) 


If we take the limit of this as n — oo, we have 0 on the right-hand side. On 
the left-hand side, because we assume that plim 3 = Bo, the limit is the same 
as the limit of 


twi(y Eo (Gos). 


By (6.12), the limit of this expression is a(3..). We assumed that Bæ 4 Bo, 
and so, by the asymptotic identification condition, a(G.) # 0. But this 
contradicts the fact that the limits of both sides of (6.14) are equal, since the 
limit of the right-hand side is 0. 


We have shown that, if we assume that a deterministic Bæ exists, then asymp- 
totic identification is sufficient for consistency. Although we will not attempt 
to prove it, asymptotic identification is also necessary for consistency. The 
key to a proof is showing that, if the parameters of a model are not asymp- 
totically identified by a given estimation method, then no deterministic limit 
like Ba exists in general. An example of this is provided by the model (3.20); 
see also Exercise 6.2. 


The identifiability of a parameter vector, whether asymptotic or by a data set, 
depends on the estimation method used. In the present context, this means 
that certain choices of the variables in W, may identify the parameters of a 
model like (6.01), while others do not. We can gain some intuition about this 
matter by looking a little more closely at the limiting functions a(3) defined 
by (6.12). We have 


a(B) = plim 1WwW'(y — æ(8)) 
= plim 4 WT (x(80) — æ(B) + u) 
n=oo o (6.15) 
= a(Bo) + piii zW (x(Bo) E a(/3)) 


= plim 4 W"(a(o) — æ(8)). 


n— oo 


Therefore, for asymptotic identification, and so also for consistency, the last 
expression in (6.15) must be nonzero for all 3 4 Gp. 


Evidently, a necessary condition for asymptotic identification is that there be 
no Bı # Bo such that (81) = (Bo). This condition is the nonlinear analog of 
the requirement of linearly independent regressors for linear regression models. 
We can now see that this requirement is in fact a condition necessary for the 
identification of the model parameters, both by a data set and asymptotically. 
Suppose that, for a linear regression model, the columns of the regressor 
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matrix X are linearly dependent. This implies that there is a nonzero vector b 
such that Xb = 0; recall the discussion in Section 2.2. Then it follows that 
Xo = X(Bo + b). For a linear regression model, «(3) = X6. Therefore, 
if we set 3, = Bo + b, the linear dependence means that (81) = x(o), in 
violation of the necessary condition stated at the beginning of this paragraph. 


For a linear regression model, linear independence of the regressors is both 
necessary and sufficient for identification by any data set. We saw above that 
it is necessary, and sufficiency follows from the fact, discussed in Section 2.2, 
that X'X is nonsingular if the columns of X are linearly independent. If 
X'X is nonsingular, the OLS estimator (X'X)-1X'y exists and is unique 
for any y, and this is precisely what is meant by identification by any data set. 


For nonlinear models, however, things are more complicated. In general, more 
is needed for identification than the condition that no 3, Æ Bo exist such that 
x((3,) = x(Bo). The relevant issues will be easier to understand after we have 
derived the asymptotic covariance matrix of the estimator defined by (6.10), 
and so we postpone study of them until later. 


The MM estimator B defined by (6.10) is actually consistent under consider- 
ably weaker assumptions about the error terms than those we have made. The 
key to the consistency proof is the requirement that the error terms satisfy 
the condition 

plim + Wu = 0. (6.16) 

n—- co 
Under reasonable assumptions, it is not difficult to show that this condition 
holds even when the u; are heteroskedastic, and it may also hold even when 
they are serially correlated. However, difficulties can arise when the u, are 
serially correlated and x;(() depends on lagged dependent variables. In this 
case, it will be seen later that the expectation of u, conditional on the lagged 
dependent variable is nonzero in general. Therefore, in this circumstance, con- 
dition (6.16) will not hold whenever W includes lagged dependent variables, 
and such MM estimators will generally not be consistent. 


Asymptotic Normality 


The MM estimator @ defined by (6.10) for different possible choices of W 
is asymptotically normal under appropriate conditions. As we discussed in 
Section 5.4, this means that the vector n!/ 2(8 — Bo) follows the multivariate 
normal distribution with mean vector 0 and a covariance matrix that will be 
determined shortly. 


Before we start our analysis, we need some notation, which will be used exten- 
sively in the remainder of this chapter. In formulating the generic nonlinear 
regression model (6.01), we deliberately used x;(-) to denote the regression 
function, rather than f;(-) or some other notation, because this notation makes 
it easy to see the close connection between the nonlinear and linear regression 
models. It is natural to let the derivative of x;() with respect to 3; be de- 
noted X;;(3). Then we can let X;(@) denote a 1 x k vector, and X(() denote 
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an n x k matrix, each having typical element X;;(3). These are the analogs of 
the vector X; and the matrix X for the linear regression model. In the linear 
case, when the regression function is XØ, it is easy to see that X;(3) = X: 
and X(3) = X. The big difference between the linear and nonlinear cases is 
that, in the latter case, X;(() and X(@) depend on £. 


If we multiply (6.10) by n12, replace y by what it is equal to under the 
DGP (6.01) with parameter vector 39, and replace 3 by B, we obtain 


n/?Ww'(u + 2(Go) — #(8)) = 0. (6.17) 


The next step is to apply Taylor’s Theorem to the components of the vec- 
tor a(3); see the discussion of this theorem in Section 5.6. We apply the 
formula. (5.45), replacing w by the true parameter vector Bo and h by the 
vector B — Go, and obtain, for t = 1,...,n, 


k 
zil) = x(o) + DXi lB (Gi — Boi), (6.18) 


where Bo; is the itè element of Go, and Bi, which plays the role of x + th 
in (5.45), satisfies the condition 


||: — Bol] < || — Goll. (6.19) 


Substituting the Taylor expansion (6.18) into (6.17) yields 


nV? Wlu—n/?W'X(B)(B — Bo) = 0. (6.20) 
The notation X(@) is convenient, but slightly inaccurate. According to (6.18), 
we need different parameter vectors 3; for each row of that matrix. But, since 
all of these vectors satisfy (6.19), it is not necessary to make this fact explicit 
in the notation. Thus here, and in subsequent chapters, we will refer to a 
vector 3 that satisfies (6.19), without implying that it must be the same 
vector for every row of the matrix X(@). This is a legitimate notational 
convenience, because, since B is consistent, as we have seen that it is under 
the requirement of asymptotic identification, then so too are all of the 6z. 
Consequently, (6.20) remains true asymptotically if we replace 3 by Bo. Doing 
this, and rearranging factors of powers of n so as to work only with quantities 
which have suitable probability limits, yields the result that 


nV? Wu —n-!W'X(Go) nt? (Â — Bo) = 0, (6.21) 


This result is the starting point for all our subsequent analysis. 


We need to apply a law of large numbers to the first factor of the second term 
of (6.21), namely, n~1W' Xo, where for notational ease we write Xo = X((o). 
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Under reasonable regularity conditions, not unlike those needed for (3.17) to 
hold, we have 


plim W7 Xo = lim ~W'E(X(60)) = Swrx, 
where Syrx is a deterministic k x k matrix. It turns out that a sufficient 
condition for the parameter vector B to be asymptotically identified by the 
estimator 8 defined by the moment conditions (6.10) is that Swrtx should 
have full rank. To see this, observe that (6.21) implies that 


Swrx n! (Â — Bo) £ n7t Wu. (6.22) 


Because Sytx is assumed to have full rank, its inverse exists. Thus we can 
multiply both sides of (6.22) by this inverse to obtain a well-defined expression 
for the limit of nt/? (8 — Bo): 


n'/?(8 — Bo) = (Swix) n "Wu. (6.23) 


From this, we conclude that 8 is asymptotically identified by Ê. The condition 
that SwrTx be nonsingular is called strong asymptotic identification. It is a 
sufficient but not necessary condition for ordinary asymptotic identification. 


The second factor on the right-hand side of (6.23) is a vector to which we 
should, under appropriate regularity conditions, be able to apply a central 
limit theorem. Since, by (6.09), E(W;u:) = 0, we can show that n72? W Tu 
is asymptotically multivariate normal, with mean vector 0 and a finite covar- 
iance matrix. To do this, we can use exactly the same reasoning as was used in 
Section 4.5 to show that the vector v of (4.53) is asymptotically multivariate 
normal. Because the components of n!/ 2(8 — 2) are, asymptotically, linear 
combinations of the components of a vector that follows the multivariate nor- 
mal distribution, we conclude that n!/ 2(8 — Bo) itself must be asymptotically 
normally distributed with mean vector zero and a finite covariance matrix. 
This implies that 6 is root-n consistent in the sense defined in Section 5.4. 


Asymptotic Efficiency 


The asymptotic covariance matrix of n—!/?W'u, the second factor on the 
right-hand side of (6.23), is, by arguments exactly like those in (4.54), 


o plim + W'W = o3Swrw, (6.24) 


n— oo 
where o@ is the error variance for the true DGP, and where we make the defini- 


tion Swtw = plimn -tW W. From (6.23) and (6.24), it follows immediately 
that the asymptotic covariance matrix of the vector n!/?(3 — Go) is 


o5(Swrx) ‘Swtw(Swrx) ', (6.25) 
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which has the form of a sandwich. By the definitions of Sytw and Swrx, 
expression (6.25) can be rewritten as 


og plim(n tW Xo) tn !W' W (n 1X) Wy! 
= o? plim (n! XJ W (WTW) WT Xo) ` 


n— oo 


= oĉ plim (n`! X} Pw Xoy t, (6.26) 


n— Oo 


where Pw is the orthogonal projection on to 8(W), the subspace spanned by 
the columns of W. Expression (6.26) is the asymptotic covariance matrix of 
the vector n!/ 2(8 — Bo). However, it is common to refer to it as the asymp- 
totic covariance matrix of B, and we will allow ourselves this slight abuse of 
terminology when no confusion can result. 


It is clear from the result (6.26) that the asymptotic covariance matrix of 
the estimator B depends on the variables W used to obtain it. Most choices 
of W will lead to an inefficient estimator by the criterion of the asymptotic 
covariance matrix, as we would be led to suspect by the fact that (6.25) has the 
form of a sandwich; see Section 5.5. An efficient estimator by that criterion is 
given by the choice W = Xo. To demonstrate this, we need to show that this 
choice of W minimizes the asymptotic covariance matrix, in the sense used in 
the Gauss-Markov theorem. Recall that one covariance matrix is said to be 
“oreater” than another if the difference between it and the other is a positive 
semidefinite matrix. 


If we set W = Xo to define the MM estimator, the asymptotic covariance 
matrix (6.26) becomes cê plim(n~!Xo' Xo) 1. As we saw in Section 3.5, it 
is often easier to establish efficiency by reasoning in terms of the precision 
matrix, that is, the inverse of the covariance matrix, rather than in terms of 
the covariance matrix itself. Since 


Xo Xo — Xo Pw Xo = Xo MwXo, 


which is a positive semidefinite matrix, it follows at once that the precision 
of the estimator obtained by setting W = Xo is greater than that of the 
estimator obtained by using any other choice of W. 


Of course, we cannot actually use Xo for W in practice, because Xo = X (8o) 
depends on the unknown true parameter vector Bo. The MM estimator that 
uses Xo for W is therefore said to be infeasible. In the next section, we will 
see how to overcome this difficulty. The nonlinear least squares estimator that 
we will obtain will turn out to have exactly the same asymptotic properties 
as the infeasible MM estimator. 
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There are at least two ways in which we can approximate the asymptotically 
efficient, but infeasible, MM estimator that uses Xo for W. The first, and 
perhaps the simpler of the two, is to begin by choosing any W for which W; 
belongs to the information set Q, and using this W to obtain a preliminary 
consistent estimate, say B, of the model parameters. We can then estimate 3 
once more, setting W = X = X (É). The consistency of É ensures that X 
tends to the efficient choice Xo as n — oo. 


A more subtle approach is to recognize that the above procedure estimates the 
same parameter vector twice, and to compress the two estimation procedures 
into one. Consider the moment conditions 


X'(B)(y — x(B)) = 0. (6.27) 


If the estimator B obtained by solving the k equations (6.27) is consistent, 
then X = X(@) tends to Xp as n — oo. Therefore, it must be the case 
that, for sufficiently large samples, B is very close to the infeasible, efficient 
MM estimator. 


The estimator @ based on (6.27) is known as the nonlinear least squares, or 
NLS, estimator. The name comes from the fact that the moment conditions 
(6.27) are just the first-order conditions for the minimization with respect 
to B of the sum-of-squared-residuals (or SSR) function. The SSR function is 
defined just as in (1.49), but for a nonlinear regression function: 


n 


SSR(B) = X (v: — 24(8)) = (y — 2(8))"(y — (8). (6.28) 


t=1 


It is easy to check (see Exercise 6.4) that the moment conditions (6.27) are 
equivalent to the first-order conditions for minimizing (6.28). 


Equations (6.27), which define the NLS estimator, closely resemble equa- 
tions (6.08), which define the OLS estimator. Like the latter, the former can 
be interpreted as orthogonality conditions: They require that the columns of 
the matrix of derivatives of (8) with respect to B should be orthogonal to 
the vector of residuals. There are, however, two major differences between 
(6.27) and (6.08). The first difference is that, in the nonlinear case, X(() 
is a matrix of functions that depend on the explanatory variables and on £, 
instead of simply a matrix of explanatory variables. The second difference is 
that equations (6.27) are nonlinear in 8, because both x(@) and X({) are, 
in general, nonlinear functions of 8B. Thus there is no closed-form expression 
for B comparable to the famous formula (1.46). As we will see in Section 6.4, 
this means that it is substantially more difficult to compute NLS estimates 
than it is to compute OLS ones. 
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Consistency of the NLS Estimator 


Since it has been assumed that every variable on which x;(3) depends belongs 
to Q;, it must be the case that x,(G) itself belongs to Q; for any choice of 73. 
Therefore, the partial derivatives of x;(68), that is, the elements of the row 
vector X;(3), must belong to Q, as well, and so 


E(X;(3) uz) = 0. (6.29) 


If we define the limiting functions a(@) for the estimator based on (6.27) 
analogously to (6.12), we have 


. 1 
a(ß) = plim —X'(B)(y — #(A)). 
It follows from (6.29) and the law of large numbers that a( Bo) = 0 if the true 
parameter vector is Bo. Thus the NLS estimator is consistent provided that 
it is asymptotically identified. We will have more to say in the next section 
about identification and the NLS estimator. 


Asymptotic Normality of the NLS Estimator 


The discussion of asymptotic normality in the previous section needs to be 
modified slightly for the NLS estimator. Equation (6.20), which resulted from 
applying Taylor’s Theorem to æ(ĝ), is no longer true, because the matrix W 
is replaced by X(), which, unlike W, depends on the parameter vector 3. 
When we take account of this fact, we obtain a rather messy additional term 
in (6.20) that depends on the second derivatives of x(). However, it can 
be shown that this extra term vanishes asymptotically. Therefore, equation 
(6.21) remains true, but with Xo = X(@o) replacing W. This implies that, 
for NLS, the analog of equation (6.23) is 


f ä =1 
n¥/2(8 = By) & ( plim 1X0 Xo) n-V/2 Xela, (6.30) 
from which the asymptotic normality of the NLS estimator follows by essen- 
tially the same arguments as before. 


Slightly modified versions of the arguments for MM estimators of the previous 
section also yield expressions for the asymptotic covariance matrix of the 
NLS estimator 3. The consistency of B means that 

plim 1T% = plim + XFX and plim + XX = plim 1 XJ Xo. 


N— oo n— Co n— oo n— oo 


Thus, on setting W = È, (6.26) gives for the asymptotic covariance matrix 
of nt/? (B — Bo) the matrix 

=i 21 
o? plim (2 Xi Px Xo) =o? plim (2 Xo. Xo) (6.31) 


n— Co n— oo 
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It follows that a consistent estimator of the covariance matrix of Ê, in the 
sense of (5.22), is 


Var(ĝ) = s*(XTX)}, (6.32) 


where, by analogy with (3.49), 


3 


(ye — 24(B)) (6.33) 


Of course, s? is not the only consistent estimator of g? that we might reason- 


ably use. Another possibility is to use 
ae. (6.34) 


However, we will see shortly that (6.33) has particularly attractive properties. 


NLS Residuals and the Variance of the Error Terms 


Not very much can be said about the finite-sample properties of nonlinear 
least squares. The techniques that we used in Chapter 3 to obtain the finite- 
sample properties of the OLS estimator simply cannot be used for the NLS 
one. However, it is easy to show that, if the DGP is 


y = z(o) +u, u~ IID(0,o@I), (6.35) 


which means that it is a special case of the model (6.02) that is being esti- 
mated, then 


E(SSR(8)) < nog. (6.36) 
The argument is just this. From (6.35), y — x( Bo) = u. Therefore, 


E(SSR(@o)) = E(ulu) = nos. 


Since B minimizes the sum of squared residuals and 6o in general does not, 
it must be the case that SSR(B) < SSR(Go). The inequality (6.36) follows 
immediately. Thus, just like OLS residuals, NLS residuals have variance less 
than the variance of the error terms. 


The consistency of B implies that the NLS residuals i, converge to the error 
terms uz as n — oo. This means that it is valid asymptotically to use either 
s? from (6.33) or G? from (6.34) to estimate 0”. However, we see from (6.36) 
that the NLS residuals are too small. Therefore, by analogy with the exact 
results for the OLS case that were discussed in Section 3.6, it seems plausible 
to divide by n — k instead of by n when we estimate o?. In fact, as we now 
show, there is an even stronger justification for doing this. 
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A 


If we apply Taylor’s Theorem to a typical residual, a = yt — x:( 8), expanding 
around 6o and substituting uz + x+(Bo) for yz, we obtain 


tt = Yt — xt(Bo) z X,(B = Bo) 
= ut + £+( Bo) — z+( Bo) — Xi(B — Bo) 
= u — X;(Ê — Bo), 


where X; denotes the t' row of X(@), for some @ that satisfies (6.19). This 
implies that, for the entire vector of residuals, we have 


ûù = u — X(B—- Bo). (6.37) 


For the NLS estimator B, the asymptotic result (6.23) becomes 


nt? (Ê — Bo) = (Sxtx) tn? Xou, (6.38) 
where 
Sxrx = plim + Xo Xo. (6.39) 


We have redefined Sxtx here. The old definition, (3.17), applies only to 
linear regression models. The new definition, (6.39), applies to both linear 
and nonlinear regression models, since it reduces to the old one when the 
regression function is linear. When we substitute Sxtx into (6.37), noting 
that 8 tends asymptotically to Bo, we find that 


u— n12 Xo(Sxtx) n7? Xdu 
£ u-n 'Xo(n tX? Xo) Xd u 
= u — Xo( Xd Xo) Xou 


ù 


(6.40) 
= u — Px,u = Mx,u, 


where Px, and Mx, project orthogonally on to $(Xo) and $+(Xọo), respec- 
tively. This asymptotic result for NLS looks very much like the exact result 
that ù = Myu for OLS. A more intricate argument can be used to show that 
the difference between &'& and wu'Myx,u tends to zero as n — oo; see Exer- 
cise 6.8. Since Xo is an n x k matrix, precisely the same argument that was 
used for the linear case in (3.48) shows that E(ù'ù) = o2(n — k). Thus we 
see that, in the case of nonlinear least squares, s? provides an approximately 


unbiased estimator of o°. 
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6.4 Computing NLS Estimates 


We have not yet said anything about how to compute nonlinear least squares 
estimates. This is by no means a trivial undertaking. Computing NLS esti- 
mates is always much more expensive than computing OLS ones for a model 
with the same number of observations and parameters. Moreover, there is a 
risk that the program may fail to converge or may converge to values that 
do not minimize the SSR. However, with modern computers and well-written 
software, NLS estimation is usually not excessively difficult. 


In order to find NLS estimates, we need to minimize the sum-of-squared- 
residuals function SSR(G) with respect to B. Since SSR(@) is not a quadratic 
function of 6, there is no analytic solution like the classic formula (1.46) for 
the linear regression case. What we need is a general algorithm for minimizing 
a sum of squares with respect to a vector of parameters. In this section, we 
discuss methods for unconstrained minimization of a smooth function Q(@). 
It is easiest to think of Q(G) as being equal to SSR(@), but much of the dis- 
cussion will be applicable to minimizing any sort of criterion function. Since 
minimizing Q(@) is equivalent to maximizing —Q(Q), it will also be appli- 
cable to maximizing any sort of criterion function, such as the loglikelihood 
functions that we will encounter in Chapter 10. 


We will give an overview of how numerical minimization algorithms work, 
but we will not discuss many of the important implementation issues that can 
substantially affect the performance of these algorithms when they are incor- 
porated into computer programs. Useful references on the art and science of 
numerical optimization, especially as it applies to nonlinear regression prob- 
lems, include Bard (1974), Gill, Murray, and Wright (1981), Quandt (1983), 
Bates and Watts (1988), Seber and Wild (1989, Chapter 14), and Press et al. 
(1992a, 1992b, Chapter 10). 


There are many algorithms for minimizing a smooth function Q(). Most 
of these operate in essentially the same way. The algorithm goes through a 
series of iterations, or steps, at each of which it starts with a particular value 
of B and tries to find a better one. It first chooses a direction in which to 
search and then decides how far to move in that direction. After completing 
the move, it checks to see whether the current value of 8 is sufficiently close to 
a local minimum of Q(Q). If it is, the algorithm stops. Otherwise, it chooses 
another direction in which to search, and so on. There are three principal 
differences among minimization algorithms: the way in which the direction 
to search is chosen, the way in which the size of the step in that direction 
is determined, and the stopping rule that is employed. Numerous choices for 
each of these are available. 


Newton’s Method 


All of the techniques that we will discuss are based on Newton’s Method. 
Suppose that we wish to minimize a function Q(3), where 8 is a k-vector and 
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Q(B) is assumed to be twice continuously differentiable. Given any initial 
value of 6, say Gio), we can perform a second-order Taylor expansion of Q(B) 
around ĝo) in order to obtain an approximation Q* (6) to Q(B): 


Q*(B) = Q(B@)) + 9(0)(8 — Boy) + 48 - Boy)" Ho (8 - Boy), (6-41) 


where g(), the gradient of Q(B), is a column vector of length k with typ- 
ical element 0Q()/0G;, and H(@), the Hessian of Q(B), is a k x k matrix 
with typical element 0?Q(G)/0G;06;. For notational simplicity, g(o) and H(o) 
denote g(G(o)) and H(Go)), respectively. 


It is easy to see that the first-order conditions for a minimum of Q*(B) with 
respect to 8 can be written as 


go) + Ho) (B — Boo)) = 9. 


Solving these yields a new value of 3, which we will call Bq): 
Bay = Bo) — Ho) 90): (6.42) 


Equation (6.42) is the heart of Newton’s Method. If the quadratic approxi- 
mation Q* (6) is a strictly convex function, which it will be if and only if the 
Hessian Ho) is positive definite, Bq) will be the global minimum of Q*(@). 
If, in addition, Q*() is a good approximation to Q(), Ba) should be close 
to B, the minimum of Q(B). Newton’s Method involves using equation (6.42) 
repeatedly to find a succession of values Ba), G(2).... When the original 
function Q(6) is quadratic and has a global minimum at 8, Newton’s Method 
evidently finds 8 in a single step, since the quadratic approximation is then 
exact. When Q(B) is approximately quadratic, as all sum-of-squares func- 
tions are when sufficiently close to their minima, Newton’s Method generally 
converges very quickly. 


Figure 6.1 illustrates how Newton’s Method works. It shows the contours of 
the function Q(B) = SSR(61, G2) for a regression model with two parameters. 
Notice that these contours are not precisely elliptical, as they would be if 
the function were quadratic. The algorithm starts at the point marked “0” 
and then jumps to the point marked “1”. On the next step, it goes in almost 
exactly the right direction, but it goes too far, moving to “2”. It then retraces 
its own steps to “3”, which is essentially the minimum of SSR(1, 32). After 
one more step, which is too small to be shown in the figure, it has essentially 
converged. 


Although Newton’s Method works very well in this example, there are many 
cases in which it fails to work at all, especially if Q(G) is not convex in the 
neighborhood of ((;) for some j in the sequence. Some of the possibilities 
are illustrated in Figure 6.2. The one-dimensional function shown there has 
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Figure 6.1 Newton’s Method in two dimensions 


a global minimum at Ê, but when Newton’s Method is started at points such 
as 3’ or 6", it may never find ĝ. In the former case, Q(8) is concave at 8’ 
instead of convex, and this causes Newton’s Method to head off in the wrong 
direction. In the latter case, the quadratic approximation at 3”, Q* (8), which 
is shown by the dashed curve, is extremely poor for values away from 8”, 
because Q(/3) is very flat near 8”. It is evident that Q* (8) will have a minimum 
far to the left of 6. Thus, after the first step, the algorithm will be very much 
further away from 8 than it was at its starting point. 


One important feature of Newton’s Method and algorithms based on it is that 
they must start with an initial value of 6. It is impossible to perform a Tay- 
lor expansion around 8o) without specifying Gio). As Figure 6.2 illustrates, 
where the algorithm starts may determine how well it performs, or whether it 
converges at all. In most cases, it is up to the econometrician to specify the 
starting values. 


Quasi-Newton Methods 


Most effective nonlinear optimization techniques for minimizing smooth crite- 
rion functions are variants of Newton’s Method. These quasi-Newton methods 
attempt to retain the good qualities of Newton’s Method while surmounting 
problems like those illustrated in Figure 6.2. They replace (6.42) by the 
slightly more complicated formula 


By +1) = By) — aq DG) au): (6.43) 
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Bl = A g" 


Figure 6.2 Cases for which Newton’s Method will not work 


which determines B;j+1), the value of 68 at step j + 1, as a function of ((;). 
Here aç;) is a scalar which is determined at each step, and Do) = D(G,j)) 
is a matrix which approximates H(;) near the minimum but is constructed 
so that it is always positive definite. In contrast to quasi-Newton methods, 
modified Newton methods set D(;) = H(;), and Newton’s Method itself sets 
Do) = HG) and ag) = 1. 

Quasi-Newton algorithms involve three operations at each step. Let us denote 
the current value of B by Bg). If j = 0, this is the starting value, G(); 
otherwise, it is the value reached at iteration j. The three operations are 


1. Compute g(;) and D(;) and use them to determine the direction DG) gG): 


2. Find a,j). Often, this is done by solving a one-dimensional minimization 
problem. Then use (6.43) to determine ((;+1). 


3. Decide whether ((j;41) provides a sufficiently accurate approximation 
to B. If so, stop. Otherwise, return to 1. 


Because they construct D(@) in such a way that it is always positive definite, 
quasi-Newton algorithms can handle problems where the function to be mini- 
mized is not globally convex. The various algorithms choose D() in a number 
of ways, some of which are quite ingenious and may be tricky to implement 
on a digital computer. As we will shortly see, however, for sum-of-squares 
functions there is a very easy and natural way to choose D(). 


The scalar aj) is often chosen so as to minimize the function 
= = 
Q"(a) = Q(B — @DG) 9); 


regarded as a one-dimensional function of a. It is fairly clear that, for the 
example in Figure 6.1, choosing a in this way would produce even faster 
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convergence than setting a = 1. Some algorithms do not actually minimize 
QÏ (a) with respect to a, but merely choose ai;) so as to ensure that Q(G(;+1)) 
is less than Q(G(;)). It is essential that this be the case if we are to be 
sure that the algorithm will always make progress at each step. The best 
algorithms, which are designed to economize on computing time, may choose 
a quite crudely when they are far from 3, but they almost always perform an 
accurate one-dimensional minimization when they are close to Ê. 


Stopping Rules 


No minimization algorithm running on a digital computer will ever find B 
exactly. Without a rule telling it when to stop, the algorithm will just keep 
on going forever. There are many possible stopping rules. We could, for 
example, stop when Q(@(j;~1)) — Q(G(;)) is very small, when every element 
of ggj) is very small, or when every element of the vector Bij) — Bg-1) is very 
small. However, none of these rules is entirely satisfactory, in part because 
they depend on the magnitude of the parameters. This means that they will 
yield different results if the units of measurement of any variable are changed 
or if the model is reparametrized in some other way. A more logical rule is to 
stop when 

Ij DG IG) <& (6.44) 
where £, the convergence tolerance, is a small positive number that is chosen 
by the user. Sensible values of € might range from 1071? to 1074. The 
advantage of (6.44) is that it weights the various components of the gradient in 
a manner inversely proportional to the precision with which the corresponding 
parameters are estimated. We will see why this is so in the next section. 


Of course, any stopping rule may work badly if e is chosen incorrectly. If € 
is too large, the algorithm may stop too soon, when ((;) is still far away 
from 6. On the other hand, if € is too small, the algorithm may keep going 
long after Bj) is so close to Ê that any differences are due solely to round-off 
error. It may therefore be a good idea to experiment with the value of £ to see 
how sensitive to it the results are. If the reported G changes noticeably when € 
is reduced, then either the first value of € was too large, or the algorithm is 
having trouble finding an accurate minimum. 


Local and Global Minima 


Numerical optimization methods based on Newton’s Method generally work 
well when Q(B) is globally convex. For such a function, there can be at most 
one local minimum, which will also be the global minimum. When Q(B) is 
not globally convex but has only a single local minimum, these methods also 
work reasonably well in many cases. However, if there is more than one local 
minimum, optimization methods of this type often run into trouble. They 
will generally converge to a local minimum, but there is no guarantee that it 
will be the global one. In such cases, the choice of the starting values, that 
is, the vector (o), can be extremely important. 
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B p” B 


Figure 6.3 A criterion function with multiple minima 


This problem is illustrated in Figure 6.3. The one-dimensional criterion func- 
tion Q(8) shown in the figure has two local minima. One of these, at Ê, is 
also the global minimum. However, if a Newton or quasi-Newton algorithm 
is started to the right of the local maximum at 8”, it will probably converge 
to the local minimum at 8’ instead of to the global one at Ê. 


In practice, the usual way to guard against finding the wrong local minimum 
when the criterion function is known, or suspected, not to be globally convex 
is to minimize Q() several times, starting at a number of different starting 
values. Ideally, these should be quite dispersed over the interesting regions of 
the parameter space. This is easy to achieve in a one-dimensional case like 
the one shown in Figure 6.3. However, it is not feasible when @ has more 
than a few elements: If we want to try just 10 starting values for each of k 
parameters, the total number of starting values will be 10%. Thus, in practice, 
the starting values will cover only a very small fraction of the parameter 
space. Nevertheless, if several different starting values all lead to the same 
local minimum 8, with Q(B) less than the value of Q() observed at any 
other local minimum, then it is plausible, but by no means certain, that B is 
actually the global minimum. 


Numerous more formal methods of dealing with multiple minima have been 
proposed. See, among others, Veall (1990), Goffe, Ferrier, and Rogers (1994), 
Dorsey and Mayer (1995), and Andrews (1997). In difficult cases, one or more 
of these methods should work better than simply using a number of starting 
values. However, they tend to be computationally expensive, and none of 
them works well in every case. 
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Many of the difficulties of computing NLS estimates are related to the iden- 
tification of the model parameters by different data sets. The identification 
condition for NLS is rather different from the identification condition for the 
MM estimators discussed in Section 6.2. For NLS, it is simply the requirement 
that the function SSR(G) should have a unique minimum with respect to 7. 
This is not at all the same requirement as the condition that the moment 
conditions (6.27) should have a unique solution. In the example of Figure 6.3, 
the moment conditions, which for NLS are first-order conditions, are satisfied 
not only at the local minima (@ and 8’, but also at the local maximum g8”. 
However, Â is the unique global minimum of SSR(3), and so 8 is identified 
by the NLS estimator. 


The analog for NLS of the strong asymptotic identification condition that 
Swtx should be nonsingular is the condition that Sy+x should be nonsingu- 
lar, since the variables W of the MM estimator are replaced by Xo for NLS. 
The strong condition for identification by a given data set is simply that the 
matrix XTX should be nonsingular, and therefore positive definite. It is easy 
to see that this condition is just the sufficient second-order condition for a 
minimum of the sum-of-squares function at £. 


The Geometry of Nonlinear Regression 


For nonlinear regression models, it is not possible, in general, to draw faithful 
geometrical representations of the estimation procedure in just two or three 
dimensions, as we can for linear models. Nevertheless, it is often useful to 
illustrate the concepts involved in nonlinear estimation geometrically, as we 
do in Figure 6.4. Although the vector x(8) lies in E”, we have supposed for 
the purposes of the figure that, as the scalar parameter ( varies, æ(8) traces 
out a curve that we can visualize in the plane of the page. If the model were 
linear, (3) would trace out a straight line rather than a curve. In the same 
way, the dependent variable y is represented by a point in the plane of the 
page, or, more accurately, by the vector in that plane joining the origin to 
that point. 


For NLS, we seek the point on the curve generated by æ(8) that is closest in 
Euclidean distance to y. We see from the figure that, although the moment, or 
first-order conditions, are satisfied at three points, only one of them yields the 
NLS estimator. Geometrically, the sum-of-squares function is just the square 
of the Euclidean distance from y to x(3). Its global minimum is achieved 


A 


at x(8), not at either x(3’) or «(8”). 


We can also use Figure 6.4 to see how MM estimation with a fixed matrix W 
works. Since there is just one parameter, we need a single variable w that 
does not depend on the model parameters, and such a variable is shown in the 
figure. The moment condition defining the MM estimator is that the residuals 
should be orthogonal to w. It can be seen that this condition is satisfied only 


by the residual vector y— (3). In the figure, a dotted line is drawn continuing 
this residual vector so as to show that it is indeed orthogonal to w. There are 
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Figure 6.4 NLS and MM estimation of a nonlinear model 


cases, like the one in the figure, in which the NLS first-order conditions can be 
satisfied for more than one value of B while the conditions for MM estimation 
are satisfied for just one value, and there are cases in which the reverse is true. 
Readers are invited to use their geometrical imaginations. 


6.5 The Gauss-Newton Regression 


When the function we are trying to minimize is a sum-of-squares function, 
we can obtain explicit expressions for the gradient and the Hessian used in 
Newton’s Method. It is convenient to write the criterion function itself as 
SSR(6) divided by the sample size n: 


Q(B) =n 18SR(8) = + Y (yi — z6) Ý. 


t=1 


Therefore, using the fact that the partial derivative of x;( 8) with respect to 8; 
is X;;(), we find that the it element of the gradient is 


The gradient can be written more compactly in vector-matrix notation as 
g(B) = —2n~'X"(B)(y — z(8)). (6.45) 
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Similarly, it can be shown that the Hessian H (6) has typical element 


Hij (B) = -2 o ((u = a+(B)) a a Xul B) Xt; (Ø): (6.46) 


When this expression is evaluated at Bo, it is asymptotically equivalent to 
2 
= ` Xti(Bo) Xt; (Bo). (6.47) 
t=1 


The reason for this asymptotic equivalence is that, since y¢ = £+(Bo) + uz, the 
first term inside the large parentheses in (6.46) becomes 


OX1i(B 
> a Up. (6.48) 


Because x;() and all its first- and second-order derivatives belong to 4, the 
expectation of each term in (6.48) is 0. Therefore, by a law of large numbers, 
expression (6.48) tends to 0 as n — oo. 


Gauss-Newton Methods 


The above results make it clear that a natural choice for D(@) in a quasi- 
Newton minimization algorithm based on (6.43) is 


D(8) = 2n-'X'()X(8). (6.49) 


By construction, this D(3) is positive definite whenever X(@) has full rank. 
Substituting (6.49) and (6.45) into (6.43) yields 


Ban) = bo) + ag (n XG) Xp) na XG (y — ey) 


" ee (6.50) 
= By) Fay (XGyXyy) XG (y - ey) 
The classic Gauss-Newton method would set aj) = 1, so that 
Bury = But (XXa) XHY- eq), (6.51) 


but it is generally better to use a good one-dimensional search routine to 
choose a optimally at each iteration. This modified type of Gauss-Newton 
procedure often works quite well in practice. 


The second term on the right-hand side of (6.51) can most easily be computed 
by means of an artificial regression called the Gauss-Newton regression, or 
GNR. This artificial regression can be expressed as follows: 


y — «(B) = X(B)b + residuals. (6.52) 
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This is the simplest version of the Gauss-Newton regression. It is called 
“artificial” because the variables that appear in it are not the dependent and 
explanatory variables of the nonlinear regression (6.02). Instead, they are 
functions of these variables and of the model parameters. Before (6.52) can 
be run as a regression, it is necessary to choose the parameter vector 8 at 
which the regressand and regressors are to be evaluated. 


The regressand in (6.52) is the difference between the actual values of the 
dependent variable and the values predicted by the regression function æ(8) 
evaluated at the chosen 8. There are k regressors, each of which is a vector 
of derivatives of x( 8) with respect to one of the elements of 3. It therefore 
makes sense to think of the it? regressor as being associated with (;. The 
vector b is a vector of artificial parameters, and we write “+ residuals” rather 
than the usual “+u” to emphasize the fact that (6.52) is not a statistical 
model in the usual sense. 


The connection between the Gauss-Newton method of numerical optimiza- 
tion and the Gauss-Newton regression should now be clear. If the variables 
in (6.52) are evaluated at G(;), the OLS parameter estimates of the artificial 
parameters are 


— T ~ly T 
bio) = (XHXw) Xal- eq), 
from which it follows using (6.50) that the Gauss-Newton method gives 


Bat) = Bag) + agbo). 


Thus the GNR conveniently and cheaply performs two of the operations nec- 
essary for a step of the Gauss-Newton method. It yields a matrix which 
approximates the Hessian of SSR(6) and is always positive semidefinite. In 
addition, it computes a vector of artificial parameter estimates which is equal 
to =D. g), the direction in which the algorithm looks at iteration j. 


One potential difficulty with the Gauss-Newton method is that the matrix 
X'(B)X(B) may sometimes be very close to singular, even though the model 
is reasonably well identified by the data. If the strong identification condi- 
tion is satisfied by a given data set, then xX is positive definite. However, 
when X'(3)X(Q) is evaluated far away from ĝ, it may well be close to sin- 
gular. When that happens, the algorithm gets into trouble, because b no 
longer lies in the same k-dimensional space as 3, but rather in a subspace of 
dimension equal to the effective rank of X '(()X(@). In this event, a Gauss- 
Newton algorithm can cycle indefinitely without making any progress. The 
best algorithms for nonlinear least squares check whether this is happening 
and replace X'(3)X() with another estimate of H() whenever it does. 
See the references cited at the beginning of Section 6.4. 


Properties of the GNR 


As we have seen, when «(3) is a linear regression model with X being the 
matrix of independent variables, X() is simply equal to X. Thus, in the 
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case of a linear regression model, the GNR will simply be a regression of the 
vector y — XB on X. A special feature of the GNR for linear models is that 
the classic Gauss-Newton method converges in one step from an arbitrary 
starting point. To see this, let Gq) be the starting point. The GNR is 


y — XB) = Xb + residuals, 
and the artificial parameter estimates are 
b= (X'X) X! (y — XB) = B- Bo, 
where 3 is the OLS estimator. It follows at once that 
Ba) = Bo) + b= Ê. (6.53) 


This property has a very useful analog for nonlinear models that we will 
explore in the next section. 


The properties of the GNR (6.52) depend on the choice of 6. One interest- 
ing choice is 8, the vector of NLS parameter estimates. With this choice, 
regression (6.52) becomes 


y — ĉ = Xb + residuals, (6.54) 
where @ = æ(ĝ) and X = X(@). The OLS estimate of b from (6.54) is 
b= (XX) X(y— #). (6.55) 


Because 3 must satisfy the first-order conditions (6.27), the factor X'(y-#) 
must be a zero vector. Therefore, b = 0, and the GNR (6.54) will have no 
explanatory power whatsoever. 


This may seem an uninteresting result. After all, why would anyone want to 
run an artificial regression all the coefficients of which are known in advance 
to be zero? There are in fact two very good reasons for doing so. 


The first reason is to check that the vector B reported by a program for NLS 
estimation really does satisfy the first-order conditions (6.27). Computer pro- 
grams use many different techniques for calculating NLS estimates, and many 
programs do not yield reliable answers in every case; see McCullough (1999). 
By running the GNR (6.54), we can see whether the first-order conditions 
are satisfied reasonably accurately. If all the ¢ statistics are less than about 
1074, and the R? is less than about 1078, then the value of B reported by 
the program should be reasonably accurate. If not, there may be a problem. 
Possibly the estimation should be performed again using a tighter convergence 
criterion, possibly we should switch to a more accurate program, or possibly 
the model in question simply cannot be estimated reliably with the data set 
we are using. Of course, some programs run the GNR (6.54) and perform 
the requisite checks automatically. Once we have verified that they do so, we 
need not bother doing it ourselves. 
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Computing Covariance Matrices 


The second reason to run the GNR (6.54) is to calculate an estimate of Var (Â). 
The usual OLS covariance matrix from this regression is, by (3.50), 


Var (b) = 2 (XTY, (6.56) 


where, since the regressors have no explanatory power, s? is the same as 
the one defined in (6.33). It is equal to the SSR from the original nonlinear 
regression, divided by n—k. Evidently, the right-hand side of (6.56) is identical 
to the right-hand side of (6.32), which is the standard estimator of Var(@). 
Thus running the GNR (6.54) provides an easy way to calculate Var({). 


Good programs for NLS estimation will normally use (6.32) to estimate the 
covariance matrix of 8B. Not all programs can be relied upon to do this, 
however, and running the GNR (6.54) is a simple way to check whether they do 
so and get better estimates if they do not. Sometimes, B may be obtained by 
a method other than fully nonlinear estimation. For example, the regression 
function may be linear conditional on one parameter, and NLS estimates may 
be obtained by searching over that parameter and performing OLS estimation 
conditional on it. In such a case, it will be necessary to calculate (6.32) 
explicitly, and running the GNR (6.54) is an easy way to do so. 


The GNR (6.54) can also be used to compute a heteroskedasticity-consistent 
covariance matrix estimate. Any HCCME for the parameters b of the GNR 
will also be perfectly valid for 3. To see this, we start from the result (6.38). 
If E(uu! ) = Q, this result implies that 


Var( plim ni/2(8 = Bo)) = (Sxtx) nt X'AX(Sxtx)1. 


n— co 


A 


Therefore, from the results of Section 5.5, a reasonable way to estimate Var() 
is to use the matrix 


Varn(B) = (X'X) XTX (XTX), (6.57) 


where È isannxn diagonal matrix with the squared residual a? as the t*® 
diagonal element. This is precisely the HCCME (5.39) for the GNR (6.54). 
Of course, as in Section 5.5, Q can, and probably should, be replaced by a 
modified version with better finite-sample properties. 
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6.6 One-Step Estimation 


The result (6.53) for linear regression models has a counterpart for nonlinear 
models: If we start with estimates that are root-n consistent but inefficient, a 
single Newton, or quasi-Newton, step is all that is needed to obtain estimates 
that are asymptotically equivalent to NLS estimates. This important result 
may initially seem astonishing, but the intuition behind it is not difficult. 


Let B denote the initial root-n consistent estimates; see Section 5.4. The GNR 
(6.52) evaluated at these estimates is 


y — & = Xb + residuals, 
where ¢ = #(() and X = X(G). The estimate of b from this regression is 
b= (X'X) 1X (y— ú). (6.58) 
Then a one-step estimator is defined by the equation 
B=B+6. (6.59) 


This one-step estimator turns out to be asymptotically equivalent to the N LS 
estimator Ê, by which we mean that the difference between n!/ 2(8— Bo) and 

ni 2(3 — Bo) tends to zero as n — oo. In other words, after both are centered 
and multiplied by n!/?, the one-step estimator B and the NLS estimator B 
tend to the same random variable asymptotically. In particular, this means 
that the asymptotic covariance matrix of È is the same as that of 3. Thus B 
shares with B the property of asymptotic efficiency. For this reason, B is 
sometimes called a one-step efficient estimator. 


In order to demonstrate the asymptotic, equivalence of B and B, we begin by 
Taylor expanding the expression n~!/ 2k (ys ý) around 3 = Bo. This yields 


nl? XT (y — é) =n“? Xo (y — æo) + A(B)n'/7(G— Bo), (6-60) 
where æo = x((39), B is a parameter vector that satisfies (6.19), in the sense 


explained just after that equation, with 8 in place of 3, and A(() is the k x k 
matrix with typical element 


Ai;(8) = = 1V X1i(B)(y — x+(@)) 
Op; = 


=-1 YỌ Xalb)Xy (B) +4 2 on x;()). (6.61) 


It can be shown that, when (6.61) is evaluated at Ø, or at any root-n consistent 
estimator of Bo, the second term tends to zero but the first term does not. 


We have seen why this is so if we evaluate (6.61) at Bo. In that case, the 
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second term, like expression (6.48), becomes an average of quantities each of 
which has mean zero, while the first term is an average of quantities each of 
which has a nonzero mean. Essentially the same result holds when we evaluate 
(6.61) at any root-n consistent estimator. Thus we conclude that 


A(B) = —n 1 X'X 2 —n Xo Xo, (6.62) 


where the second equality is also a consequence of the consistency of 6. 
Using the result (6.62) in (6.60) shows that 


n2 X (y — á) & n72 Xu — nn Xd Xo n/?(B — Bo), 
which can be solved to yield 


nl/2(8 — Bo) = (n™'XI Xo)! (n? Xu -n7 XT (y — é)) ees 
= (Sxtx) n Xu = (Sxtx nX" (y — É). 


By (6.38), the first term in the second line here is equal to n!/2(@ — Bo). By 
(6.58), the second term is asymptotically equivalent to —n!/26. Thus (6.63) 
implies that 

n¥/2(8 — By) £ n (Â = Bo) =n b. 


Rearranging this and using the definition (6.59), we see that 


n'/2(8 — Bo) = n™ (É + b — Bo) = n? (Ê — Bo), (6.64) 


which is the result that we wished to show. 


Despite the rather complicated asymptotic theory needed to prove (6.64), 
the fundamental reason that makes a one-step efficient estimator based on 
the GNR asymptotically equivalent to the NLS estimator is really quite sim- 
ple. The GNR minimizes a quadratic approximation to SSR(@) around 3. 
Asymptotically, the function SSR(@) is quadratic in the neighborhood of Bo. 
If the sample size is large enough, the consistency of B implies that we will be 
taking the quadratic approximation at a point very near Bo. Therefore, the 
approximation will coincide with SSR(@) itself asymptotically. 


Although this result is of great theoretical interest, it is typically of limited 
practical utility with modern computing equipment. Once the GNR, or some 
other method for taking Newton or quasi-Newton steps, has been programmed 
for a particular model, we might as well let it iterate to convergence, because 
the savings in computer time from stopping after a single step are rarely 
substantial. Moreover, a one-step estimator will be consistent if and only 
if we start from an initial estimator that is consistent, while NLS will be 
consistent no matter where we start from, provided we converge to a global 
minimum of SSR(@). Therefore, it may well require more effort on the part 
of the investigator to obtain one-step estimates than to obtain NLS ones. 
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One-step estimators may be useful when the sample size is very large and each 
step in the minimization process is, perhaps in consequence, very expensive. 
The large sample size will often ensure that the initial, consistent estimates 
are reasonably close to the NLS ones. If they are, the one-step estimates 
should then be very close to the latter. One-step estimators can also be 
useful when the estimation needs to be repeated many times, as will often be 
required by the bootstrap and other simulation-based methods; see Davidson 
and MacKinnon (1999a). 


The Linear Regression Model with AR(1) Errors 


An excellent example of one-step efficient estimation is provided by the model 
(6.06), which is a linear regression model with AR(1) errors. The GNR that 
corresponds to (6.06) is 


Yt — PYr—-1 — Xib + pX- b 


6.65 
= (X: — pXı—1)b + bp(yt-1 — Xt-18) + residual, l ) 


where b corresponds to 8 and b, corresponds to p. As with every GNR, the 
regressand is y, minus the regression function for (6.06). The last regressor, 
which is the derivative of the regression function with respect to p, looks very 
much like a lagged residual from the original linear regression model (6.05). 
The remaining k regressors are the derivatives of the regression function with 
respect to the elements of 6. 


It is easy to obtain root-n consistent estimates of the parameters p and 8 of 
the model (6.06), because it can be written as a linear regression subject to 
nonlinear restrictions on its parameters. The linear regression is 


Ye = PYi-1 + Xib + Xi- + Er. (6.66) 


If we impose the nonlinear restrictions that y + p3 = 0, this regression is 
just (6.06). Thus the model (6.06) is a special case of the model (6.66). 
Therefore, if (6.06) is a correctly specified model, that is, if the true DGP is a 
special case of (6.06), then (6.66) must be a correctly specified model as well, 
because every DGP in (6.06) automatically belongs to (6.66). Since (6.66) is 
correctly specified, the standard theory of the linear regression with predeter- 
mined regressors applies to it, with the consequence that the OLS estimates 6 
and É obtained from (6.66) are root-n consistent. 


If we evaluate the variables of the GNR (6.65) at 6 and É, we obtain 


yt — fyi-1 — XÉ + 6X16 


, (6.67) 
= (X = pX1-1)b + bo (Ye—1 = X+_1/3) + residual. 


We can run this regression to obtain the artificial parameter estimates band bss 
and the one-step efficient estimates are just G+ b and p+ bp. 
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6.7 Hypothesis Testing 


Hypotheses about the parameters of nonlinear regression models can be for- 
mulated in much the same way as hypotheses about the parameters of linear 
regression models. Let us partition the parameter vector G as B = [61 i Bol, 
where 3; is ky x 1, B2 is k2 x 1, and Bis k x 1, with k = kı + k2. Then the 
generic nonlinear regression model (6.02) can be written as 


y = x(ßı, Bb2) +u, u ~ IID(0,07I). 


If we wish to test the hypothesis that G2 = 0, we can set up the models that 
correspond to the null and alternative hypotheses as follows: 


Ho: y= 2(G1,0) +u; (6.68) 
Ay : y= x(Bı, B2) +u. (6.69) 
Here, using notation introduced in Section 4.2, Ho denotes the null hypothesis, 


and Hı denotes the alternative. 


If the regression models (6.68) and (6.69) were linear, we could test the null 
hypothesis by means of the F statistic (4.30). In fact, we can do this even 
though they are nonlinear. The test statistic 


__ (RSSR — USSR) /r 
P2 = USSR/(n — k) 


(6.70) 


is computed in exactly the same way as (4.30), but with RSSR and USSR the 
sums of squared residuals from NLS estimation of (6.68) and (6.69), respec- 
tively. Here r = kg, since the hypothesis that B2 = 0 imposes kə restrictions. 
It is not difficult to show that (6.70) is asymptotically valid: Under the null 
hypothesis, it follows the F (r, co) distribution asymptotically. 


First, we establish some notation. Let X(@) denote the n x k matrix of partial 
derivatives of the vector of regression functions <z(8) = æ(81, B2) of (6.69). 
Similarly, let X (3) and X2() denote the n x kı and n x kg submatrices of 
partial derivatives with respect to the components of 3; and B2, respectively. 
Finally, let M, denote the orthogonal projection on to $+(X(@o)), which we 
previously called Mx,, and let Mo denote the orthogonal projection on to 
§+(X(@o)). The projection Mo corresponds to the null hypothesis Ho, and 
the projection Mı corresponds to the alternative hypothesis H4. 


By the result (6.40), under both the null and alternative hypotheses, the vector 
of residuals ů from NLS estimation of Hı is asymptotically equal to Mju. 
By essentially the same argument, under the null hypothesis, the vector of 
residuals ŭ from NLS estimation of Ho is asymptotically equal to Mou. This 
implies (see Exercise 6.8) that aa Sul Miu and at = ul Mou. Therefore, 
under Ho, r times the numerator of (6.70) is asymptotically equal to 


u' Mou — u' Miu = u'(Mo — M,)u = u'(P, — Po)u, 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


242 Nonlinear Regression 


where P) and P, are the projections complementary to Mop and Mi. By the 
result of Exercise 2.16, Pı — Po is an orthogonal projection matrix, which 
projects on to a space of dimension k — kı = k2. Thus the numerator of (6.70) 
is o2 times a x? variable with ky degrees of freedom, divided by r = ky. The 
denominator of (6.70) is just a consistent estimate of 02, and so, under Ho, 
(6.70) itself is asymptotically distributed as F(k2,00) = x? (k2)/k2. 


For linear models, we saw in Section 5.4 that the F statistic could be written 
as (5.26), which is a special case of the more general form (5.23). Not surpris- 
ingly, it is also possible to calculate test statistics of the form (5.23) to test 
the hypothesis that G2 = O in the nonlinear model (6.69). This type of test 
statistic is often called a Wald statistic, because the approach was suggested 
by Wald (1943). It can be written as 


Wa, = Ê? (Var (82) ‘Be, (6.71) 


where Bo is a vector of NLS estimates from the unrestricted model (6.69), and 
Var (82) is the NLS estimate of its covariance matrix. This is just a quadratic 
form in the vector G2 and the inverse of an estimate of its covariance matrix. 
When kz = 1, the signed square root of (6.71) is equivalent to a t statistic. 
We will see below that the Wald statistic (6.71) is asymptotically equivalent 
to the F statistic (6.70), except for the factor of 1/kə. 


Tests Based on the Gauss-Newton Regression 


Since the GNR provides a one-step estimator asymptotically equivalent to 
the NLS estimator, and it also provides the NLS estimate of the covariance 
matrix of Bo, a statistic asymptotically equivalent to (6.71) can be computed 
by means of a GNR. This statistic will also turn out to be asymptotically 
equivalent to the F statistic (6.70), except for the factor of 1/kə. 


The Gauss-Newton regression corresponding to the model (6.69) is 


y — £(3;, 82) = Xı (G1, B2)b1 + X2(G1, B2)b2 + residuals, (6.72) 


where the vector of artificial parameters b has been partitioned as [b : bə], 
conformably with the partition of X(@). If the GNR is to be used to test the 
null hypothesis that G2 = 0, the regressand and regressors must be evaluated 
at parameter estimates which satisfy the null. We will suppose that they are 
evaluated at the point É = [61,0], where By may be any root-n consistent 
estimator of Bı. Then the one-step estimator of B can be written as 


ae 3, +6 
B+6= Pr tbi | (6.73) 
bz 


By the results of Section 6.6, n1/2b, is asymptotically equivalent to n1/ 2 Bo, 
where (32 is the NLS estimator of G2 from (6.69). 
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In practice, the two estimators that are most likely to be used for By are Bi, 
the restricted NLS estimator, and (G1, a subvector of the unrestricted NLS 
estimator. Here we are once more adopting the convention, previously used 
in Chapter 4, whereby a tilde denotes restricted estimates and a hat denotes 
unrestricted ones. Both these estimators are root-n consistent under the null 
hypothesis, but Bi will generally be more efficient than B,. Whether we will 
want to use Ĝi, Bi, or some other root-n consistent estimator when performing 
GNR-based tests will depend on how difficult the various estimators are to 
compute and on the finite-sample properties of the test statistics that result 
from the various choices. 


Now consider the vector of residuals ú from OLS estimation of the GNR (6.72) 
evaluated at 8, when the true DGP is characterized by the parameter vector 
Bo = [8° : 0]. We have 

ú = y — x(81,0) — X16, — Xb» 
y — (82,0) — X1(8)(G, — BP) — Xib: — X2% 
u — X1 (By + bı — BY) — X262. (6.74) 


Ile 


Here, 3 is a parameter vector between 6o and B. To obtain the asymptotic 
equality in the last line, we have used the fact that X,(@) = Xy The one- 
step estimator (6.73) is consistent, and so the last two terms in (6.74) tend to 
zero as n — co. Thus the residuals ú; are asymptotically equal to the error 
terms uz, and so n™túlú is asymptotically equal to 02, the true error variance. 
In fact, because of the asymptotic equivalence of the one-step estimator È and 
the NLS estimator 6, (6.74) tells us that ú 4 u — X(B — Bo). An argument 
like that of (6.40) then shows that ú is asymptotically equivalent to Mx,u. 
For the moment, however, we do not need this more refined result. 


The GNR (6.72) evaluated at É is 
y — & = Xb, + X2bp + residuals. (6.75) 


Since this is a linear regression, we can apply the FWL Theorem to it. Writing 
My, for the projection on to §+(X1), we see that the FWL regression can 
be written as 

My, (y — &) = My, X2b2 + residuals. 


This FWL regression yields the same estimates bọ as does (6.75). Thus, 
inserting the factors of powers of n that are needed for asymptotic analysis, 
we find that 


n? by = (n Xz My, X>) n"? X} My, (y — 2). (6.76) 


In addition to yielding the same parameter estimates b2, the FWL regression 
has the same residuals as regression (6.75) and the same estimated covariance 
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matrix for by. The latter is 62 (X M. X), where ó? is the error variance 
estimator from (6.75), which, as we just saw, is asymptotically equal to o@. 
If Xı and Xə denote Xı (8o) and X2(o), respectively, we see that 


nX? My Xə = n IXI Xo — WO Ke X GPX Xa nÁ X 
£n IXI X — nX} X (n IXI X) tn 1X Xe 


=n |X? Mx, Xo, 


where the asymptotic equality follows as usual from the consistency of É. 
Thus n times the covariance matrix estimator for 62 given by the GNR (6.75) 
provides a consistent estimate of the asymptotic covariance matrix of the 
vector n!/? (Â> — G2), as given by (6.31). 


The Wald test statistic (6.71) can be rewritten as 
n¥/2BJ (nVar (G2) `n? Bo. (6.77) 


This is asymptotically equivalent to the statistic 


1 ee : ; 
zan bl (n 1X3 My, X2)n" bz, (6.78) 


which is based entirely on quantities from the GNR (6.75). That (6.77) and 
(6.78) are asymptotically equal relies on (6.76) and the fact, which we have 
just shown, that the covariance matrix estimator for bə is also valid for Bo. 


By (6.76), the GNR-based statistic (6.78) can also be expressed as 


any — á)! M% X2(n-1X)' My, Xx) n"? X} M% (y — é). (6.79) 
When this statistic is divided by r = kz, we can see by comparison with (4.33) 
that it is precisely the F statistic for a test of the artificial hypothesis that 
b = 0 in the GNR (6.75). In particular, ó? is just the sum of squared residuals 
from equation (6.75), divided by n — k. Thus a valid test statistic can be 
computed as an ordinary F statistic using the sums of squared residuals from 
the “restricted” and “unrestricted” GNRs, 


GNRo: y-£= Xb) + residuals, and (6.80) 
GNR,: y- É = X1b; + Áb + residuals. (6.81) 
In Exercise 6.9, readers are invited to show that such an F statistic is asymp- 


totically equivalent to the F statistic computed from the sums of squared 
residuals from the two nonlinear regressions (6.68) and (6.69). 


In the quite common event that Ĝi = Bi, the first-order conditions for By 
imply that regression (6.80) will have no explanatory power. There is no need 
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to run regression (6.80) in this case, because its SSR will always be identical to 
the SSR from NLS estimation of the restricted model. We will see an example 
of this in the next subsection. 


The principal advantage of tests based on the GNR is that they can be cal- 
culated without computing two nonlinear regressions, one for each of the null 
and alternative hypotheses. The principal disadvantage is that a number of 
derivatives must be calculated, one for each parameter of the unrestricted 
model. In many cases, it is necessary to run one nonlinear regression, so as to 
obtain root-n consistent estimates of the parameters under the null. However, 
it may sometimes happen that either the null or the alternative hypothesis 
corresponds to a linear model. In such cases, no nonlinear estimation at all is 
necessary to carry out a GNR-based test. 


GNR-Based Tests for Autoregressive Errors 


An example of a model which is linear under the null hypothesis is furnished by 
the linear regression model with autoregressive errors. With time-series data, 
serial correlation of the error terms is a frequent occurrence, and so one of the 
most frequently performed tests in all of econometrics is a test in which the 
null hypothesis is a linear regression model with serially uncorrelated errors 
and the alternative is the same model with AR(1) errors. In this case, we may 
think of H; as being the model (6.06) and Ho as being the model 


Ut = X,3 + Ut; Ut ~N IID(0, o°’). (6.82) 


When GNRs like (6.80) and (6.81) are used for testing, all the variables in 
them must be evaluated at a parameter vector B which satisfies the null 
hypothesis. In this case, the null hypothesis corresponds to the restriction 
that p = 0. Therefore, we must set 6 = 0 in the GNRs corresponding to the 
restricted model (6.82) and the unrestricted model (6.06). The natural choice 
for B is then Ø, the vector of OLS parameter estimates for (6.82). 


The GNR for (6.06) was given in (6.65). If this artificial regression is evaluated 
at B = B and p = 0, it becomes 
yt — Xib = Xb + bp(ys_1 — X+-18) + residual, (6.83) 


where b corresponds to @ and b, corresponds to p. If we denote the OLS 
residuals from (6.82) by tz, the GNR (6.83) takes on the very simple form 


Uz, = Xb + bpŭt—1 + residual. (6.84) 


This is just a linear regression of the residuals from (6.82) on the regressors 
of (6.82) and one more regressor, namely, the residuals lagged once. Since 
only one restriction is to be tested, a suitable test statistic is the t statistic 
for the artificial parameter b, in (6.84) to equal 0. This is the square root of 
the F statistic, which we have seen to be asymptotically valid. 
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Almost as simple as the above test is a test of the null hypothesis (6.82) 
against an alternative in which the error terms follow the AR(2) process 


Ut = p1Ut—1 + p2Ut—2 + Et, EL ~ IID(0, o°). 


It is not hard to show that an appropriate artificial regression for testing (6.82) 
against the AR(2) alternative that is analogous to (6.06) is 


Ut = Xb + by, Ŭt—1 + bpzŭt—2 + residual; (6.85) 


see Exercise 6.10. Since, in this case, we have a test with two degrees of 
freedom, we cannot use a t test. However, it is still not necessary to run two 
regressions in order to compute an F statistic. Consider the form taken by 
GNR o in this case: 

ùt = Xb + residual. (6.86) 


This is just the GNR corresponding to the linear regression (6.82). Since the 
regressand is the vector of residuals from estimating (6.82), it is orthogonal 
to the explanatory variables. Therefore, by (6.55), the artificial parameter 
estimates b are zero, and (6.86) has no explanatory power. As a result, the 
SSR from (6.86) is equal to the total sum of squares (TSS). But this is also 
the TSS from the GNR (6.85) corresponding to the alternative. Thus the 
difference between the SSRs from (6.86) and (6.85) is the difference between 
the TSS and the SSR from (6.85), or, more conveniently, the explained sum 
of squares (ESS) from (6.85). The GNR-based F statistic can therefore be 
computed by running (6.85) alone. In fact, since the denominator is just the 
estimate 8? of the error variance from (6.85), the F statistic is simply? 


E -k E 
F= m Tao (6.87) 


rs r SSR’ 


with r = 2 in this particular case. 


Asymptotically, we can obtain a valid test statistic by using any consistent 
estimate of the true error variance o? as the denominator. If we were to use 
the estimate under the null rather than the estimate under the alternative, the 
denominator of the test statistic would be (n—k,)~' Xy; t?. Asymptotically, 
it makes no difference whether we divide by n — kı or n when we estimate o°. 
Therefore, if R? is the uncentered R squared from (6.85), another perfectly 


valid test statistic is 


nESS ESS 


nR? = = 
1 n ~9? 
TSS F Si% 


(6.88) 


1 We are assuming here that regression (6.85) is run over all n observations. This 
requires either that data for observations 0 and —1 are available, or that the 
unobserved residuals ŭo and ŭ—ı are replaced by zeros. 
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which follows the x?(2) distribution asymptotically. If the regressors include 
a constant, the residuals ŭ, will have mean zero, and the uncentered R? (6.88) 
will be identical to the centered R? that is printed by most regression packages. 


Whether we use the F statistic (6.87) or the nR? statistic (6.88), the GNR 
provides a very easy way to test the null hypothesis that the error terms 
are serially uncorrelated against all sorts of autoregressive alternatives. Of 
course, neither statistic will follow its asymptotic distribution exactly in finite 
samples. However, there is some evidence—for example, Kiviet (1986) — 
that the former tends to have better finite-sample properties than the latter. 
This evidence accords with theory, because, as (6.40) shows, the relationship 
between NLS residuals and error terms is approximately the same as the 
relationship between OLS residuals and error terms. Therefore, it makes 
sense to use the F form of the statistic, which treats the estimate §? based on 
the GNR as if it were based on an ordinary OLS regression. 


The above example generalizes to all cases in which B is taken to be B from 
estimating the null hypothesis, whether or not the restricted model is linear. 
In such cases, because GNRo has no explanatory power, its SSR is equal to 
its TSS, which in turn is equal to the TSS of GNR,. In consequence, we only 
need to run GNR1, which in this case is 


y= r= Xıbı + Xəbə + residuals. 


Under the null hypothesis, nR? from this test regression is asymptotically 
distributed as y?(r). This is not the case for GNR; when B Æ B. However, 
the F test of (6.80) against (6.81) is asymptotically valid even when 8 # 73. 
It is merely required that É should satisfy the null hypothesis and be root-n 
consistent. 


Most GNR-based tests are like the ones for serial correlation that we have just 
discussed, in which the GNR is evaluated at least squares estimates under the 
null hypothesis. However, it is also possible to evaluate the GNR at estimates 
obtained under the alternative hypothesis. We will encounter tests of this 
type when we discuss common factor restrictions in Chapter 7. 


Bootstrap Tests 


Because none of the tests discussed in this section is exact in finite samples, 
it is often desirable to compute bootstrap P values, which, in most cases, 
will be more accurate than ones based on asymptotic theory. The procedures 
for computing bootstrap P values for nonlinear regression models are essen- 
tially the same as the ones described in Section 4.6 for linear models. We 
use estimates under the null to generate B bootstrap samples, usually either 
generating the error terms from the N(0, 87) distribution or resampling the 
rescaled residuals, and we then compute a bootstrap test statistic T; using 
each of the bootstrap samples. For a test that rejects when the test statistic 7 
is large, the bootstrap P value is then 1 — F*(7), where F*(7) denotes the 
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EDF of the T; evaluated at 7. Of course, this procedure can sometimes be 
computationally expensive; see Davidson and MacKinnon (1999a) for a way 
of making it somewhat less so. 


6.8 Heteroskedasticity-Robust Tests 


All of the tests dealt with in the preceding section are valid only under the 
assumption that the error terms are IID. This assumption, which may be 
uncomfortably strong in some cases, can be relaxed for GNR-based tests by 
using a modified version of the GNR. 


As in Section 5.5, let us suppose that the covariance MANNE of the error terms 
is Q, an n xn diagonal matrix with tt? diagonal element w?. Then the matrix 
(6. 57) provides a heteroskedasticity-consistent estimate of ‘Var(A), which can 
be used in place of the usual estimate. The result will be a heteroskedasticity- 
robust test Statist, the asymptotic distribution of which will be the same no 
matter what the w? happen to be, provided the regularity conditions needed 
for the HCCME to be valid are satisfied. 


For Ho and H; as in (6.68) and (6.69), but with heteroskedastic errors, we wish 
to construct a Wald test statistic, similar to (6.71), that uses an HCCME to 
estimate the covariance matrix. Let B denote a vector of parameter estimates 
that is root-n consistent and satisfies the null hypothesis. Often, B will be G, 
the vector of NLS estimates under the null. By arguments similar to those 
that led to (6.79), an appropriate Wald statistic can be written as 


(y = &) "My X>(X 1M, QM, X2) 1X] Mı (y = é), (6.89) 


where R is ann xn diagonal matrix with ¢'® diagonal element equal to 7?. 
Here ú; denotes the residual y; — xil É), and all quantities with an acute ac- 
cent are evaluated at É. The test statistic (6.89) is a quadratic form in the 
vector X3 Mı (y — ź) and a matrix that estimates the inverse of its covariance 
matrix. It is easy to see that, given appropriate regularity conditions, it will 
be asymptotically distributed as x?(r) under the null hypothesis. 


It is possible to compute (6.89) by means of a modified GNR. Let U be the 
nxn diagonal matrix with tt? diagonal element equal to ú+. This implies that 
UU =U'U =. Then, for the alternative hypothesis (6.69), consider the 
artificial regression 

i= Poy Úb + residuals, (6.90) 


where, as usual, ¿ is an n-vector each component of which equals 1, and 
X =[X, X]. The matrix P; ý% is the orthogonal projection on to the 
k-dimensional space s(ÚX ). The matrix Ú-! is a diagonal matrix with 
tte diagonal element equal to úy. This will be undefined if ú, = 0. There- 
fore, in that event, it will be necessary to replace ú; by a very small, positive 
number when constructing Ú. 
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The artificial regression (6.90) is called the heteroskedasticity-robust Gauss- 
Newton regression, or HRGNR. It has essentially the same properties as 
the ordinary GNR, except that it is valid when there is heteroskedasticity of 
unknown form. If we set B equal to Ê, the vector of unrestricted estimates, 
the regressand v in (6.90) is seen to be orthogonal to all of the regressors. The 
transpose of the regressand times the matrix of regressors is 


TP, g07X = TOR (XOX) XOX 
= aX (KOR) RTÉ 
= 0. 


The equality in the second line uses the facts that Use = ù, UU = 2, and 
UU~—! =I. The one in the third line holds because the first-order conditions 
for NLS estimation are just Rù = 0. Thus, like the ordinary GNR, the 
HRGNR can be used to verify that parameter estimates satisfy the first-order 
conditions for NLS estimation. 


When it is evaluated at any B that is root-n consistent, the ordinary OLS 
covariance matrix from the HRGNR is an asymptotically valid HCCME. The 
TSS from regression (6.90) is e's = n. Therefore, when É = B, the SSR will 
also be n, and the OLS estimate of the error variance will be n/(n — k), which 
is asymptotically equal to 1. Even when B + (3, the usual sort of calculation 
shows that the OLS estimate of the error variance from (6.90) is asymptot- 
ically equal to 1. Thus, except for the asymptotically negligible difference 
between 6? and 1, the covariance matrix estimator from (6.90) is 


(XTU Py gU OX)! = (XO (XX) I XU AK) | 


ere 


AOR x 
= (XTX) XTX (XTXy1, 


which is just the HCCME (6.57) evaluated at B instead of at Ê. 


The HRGNR (6.90) also allows one-step estimation. Although this is of no 
practical interest, since it is easier to do one-step estimation with the ordinary 
GNR, it is essential that (6.90) should allow one-step estimation for tests 
based on it to be valid. Recall that the one-step property was necessary 
for our proof that statistics based on the ordinary GNR are valid. To avoid 
tedious asymptotic arguments, we will limit ourselves to showing that (6.90) 
allows one-step estimation for linear models. Extending the argument to 
nonlinear models is not difficult, but it would involve greater complication, 
chiefly notational. If we consider the linear regression model y = XB + u, 
and evaluate the HRGNR (6.90) at an arbitrary 3, we have 


b = COU PKU gt. 
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This can readily be seen to reduce to 
b= (X'X) XG = (X™X) 1X (y— XP) = B-8, 


where 3 is the OLS estimator. It follows that the one-step estimator B+bis 
equal to 3, as we wished to show. 


In order to use the HRGNR to test the null hypothesis (6.68) against the 
alternative (6.69), we need to run two versions of it and compute the difference 
between the two SSRs, which will be asymptotically distributed as y?(k2). 
The two HRGNRs that we need to run are 


HRGNRo: t= Py,U 1X1 by + residuals, and (6.91) 
HRGNRi: t= PyyU~'X1b, + Poy Ú X2b2 + residuals, (6.92) 


where B may be any root-n consistent estimator that satisfies the restrictions 
being tested. In many cases, it will be convenient to set 3 = ĝ, the OLS 
estimates from (6.68). These equations are not hard to set up. The t* row of 
UX is just the corresponding row of x multiplied by ús, and the t* row of 
U-'X is just the corresponding row of X divided by ú;. It is never necessary 
to construct the n x n matrix U at all. 


The second of the two artificial regressions, (6.92), is simply regression (6.90) 
with the matrix X explicitly partitioned. The first one, however, is not the 
HRGNR for the restricted model, because it uses the matrix Pg% rather 
than the matrix Pyy, In consequence, even if we set 6 = 6, the regressand 
in (6.91) will not be orthogonal to the regressors. This is why we need to run 
two artificial regressions. We could compute an ordinary F statistic instead 
of the difference between the SSRs from (6.91) and (6.92), but there would 
be no advantage to doing so, since the F form of the test merely divides by a 
stochastic quantity that tends to 1 asymptotically. 


A different and more limited form of the HRGNR, which is applicable only to 
hypothesis testing, was first proposed by Davidson and MacKinnon (1985a); 
see Exercise 6.21. It was later rediscovered by Wooldridge (1990, 1991) and 
extended to handle other cases, including regression models with error terms 
that have autocorrelation as well as heteroskedasticity of unknown form. 


6.9 Final Remarks 


In this chapter, we have dealt only with the estimation of nonlinear regression 
models by the method of moments and by nonlinear least squares. However, 
many of the results will reappear, in slightly different forms, when we con- 
sider estimation methods for other sorts of models. The NLS estimator is an 
extremum estimator, that is, an estimator obtained by minimizing or maxi- 
mizing a criterion function. In the next few chapters, we will encounter several 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


6.10 Exercises 251 


other extremum estimators: generalized least squares (Chapter 7), general- 
ized instrumental variables (Chapter 8), the generalized method of moments 
(Chapter 9), and maximum likelihood (Chapter 10). Most of these estima- 
tors, like the NLS estimator, can be derived from the method of moments. All 
extremum estimators share a number of common features. Similar asymptotic 
results, and similar methods of proof, apply to all of them. 


6.10 Exercises 


6.1 


6.2 


6.3 


6.4 


6.5 


6.6 


6.7 


Let the expectation of a random variable Y conditional on a set of other ran- 
dom variables X1,..., Xp be the deterministic function h(Xj,...,X;) of the 
conditioning variables. Let Q be the information set consisting of all determin- 
istic functions of the X;, i = 1,...,k. Show that E(Y |Q) = h(X1,..., Xz). 
Hint: Use the Law of Iterated Expectations for Q and the information set 
defined by the X;. 


Consider a model similar to (3.20), but with error terms that are normally 
distributed: 
yt = ba + b21/t+ ut, ue ~ NID(O, 0°), 


where t = 1,2,...,n. If the true value of (9 is B9 and Bo is the OLS estimator, 
show that the limit in probability of Bo = Bs is a normal random variable with 
mean 0 and variance 602/17. In order to obtain this result, you will need to 
use the results that $72; (1/t)? = 7/6, and that, if s(n) = 0"_, (1/2), then 
limn—+oo n` ts(n) = 0, and limn—sco n~'s?(n) = 0. 

Show that the MM estimator defined by (6.10) depends on W only through 
the span 8(W) of its columns. This is equivalent to showing that the estimator 
depends on W only through the orthogonal projection matrix Py. 


Show algebraically that the first-order conditions for minimizing the SSR func- 
tion (6.28) have the same solutions as the moment conditions (6.27). 

Apply Taylor’s Theorem to n7t times the left-hand side of the moment con- 
ditions (6.27), expanding around the true parameter vector 39. Show that 
the extra term which appears here, but was absent in (6.20), tends to zero as 
n — oo. Make clear where and how you use a law of large numbers in your 
demonstration. 


For the nonlinear regression model 
ye = biz? +ue, ut ~ IID(0, 0°), 


write down the sum of squared residuals as a function of 81, G2, Yt, and zz. 
Then differentiate it to obtain two first-order conditions. Show that these 
equations are equivalent to special cases of the moment conditions (6.27). 

In each of the following regressions, yz is the dependent variable, x; and z 
are explanatory variables, and a, 8, and y are unknown parameters. 


a) yt = a + brt + Y/2t + ut 
b) yt =a + brr +tt/Y +u 
c) yt =a + partt zt/y+ ut 
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6.10 
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d) yt =a + baxt + zt/B + ut 

e) Yt =a + Baez, + ut 

f) Yt = a+ By xe2¢t + Yzt + Ut 

g) Yt = a+ Pyare t+ y2e + ut 

h) yt = a + Gay + Ba? + ut 

i) yt = a + Bxt +y? + ut 

j) ye =a + pyr? +u 

k) yt = a + Baxt + (1 -— B) + ut 

1) yt = a + baxt + (y — p) +u 
For each of these regressions, is it possible to obtain a least-squares estimator 
of the parameters? In other words, is each of these models identified? If not, 
explain why not. If so, can the estimator be obtained by ordinary (that is, 
linear) least squares? If it can, write down the regressand and regressors for 
the linear regression to be used. 


Show that a Taylor expansion to second order of an NLS residual gives 
ût = ut — Xe (Bo)(Ê — Bo) — $(8 — Bo) H:(Ê — Bo), (6.93) 


where Bo is the parameter vector of the DGP, and the k x k matrix H, = 
H;(Q) is the matrix of second derivatives with respect to 8 of the regression 
function x;(), evaluated at some 8 that satisfies (6.19). 


Define b = n!/? (ê — Bo), so that, as n — oo, b tends to the normal random 
noo ln tXo Xo) tnt ?Xo u. By expressing equation (6.93) 
in terms of b, show that the difference between û'û and u'M. xu tends to 0 
as n — oo. Here Mx, = I — Px,, where Px, = Xo( Xod Xo) tXo is the 


orthogonal projection on to $(Xo). 

Using the result (6.40) on NLS residuals, show that the F statistic computed 
using the sums of squared residuals from the two GNRs (6.80) and (6.81) 
is asymptotically equivalent to the F statistic computed using the sums of 
squared residuals from the nonlinear regressions (6.68) and (6.69). 


variable plim 


Consider a linear regression with AR(2) errors. This can be written as 


Yt = Xb + ut, Ut = p1ut—1 + p2ut—2 +Et, e¢ ~ IID(0, o’). 


Show how to test the null hypothesis that p1 = p2 = 0 by means of a GNR. 


Consider again the ADL model (3.70) of Exercise 3.22, which is reproduced 
here with a minor notational change: 


Ct = Q + bct—1 + yoye + V1Yt—1 + Et- (6.94) 


Recall that cz and y+ are the logarithms of consumption and income, respec- 
tively. Show that this model contains as a special case the following linear 
model with AR(1) errors: 


ct = ðo + iyt tut, with ut = put—1 + €t, (6.95) 


where eż is IID. Write down the relation between the parameters ôo, 61, 
and p of this model and the parameters a, 6, yo, and 71 of (6.94). How 
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6.12 


6.13 


6.14 


6.15 


6.16 


6.17 


6.18 


6.19 


many and what restrictions are imposed on the latter set of parameters by 
the model (6.95)? 


Using the data in the file consumption.data, estimate the nonlinear model 
defined implicitly by (6.95) for the period 1953:1 to 1996:4 by nonlinear least 
squares. Since pre-sample data are available, you should use all 176 obser- 
vations for the estimation. Do not use a specialized procedure for AR(1) 
estimation. For starting values, use the estimates of 69, 61, and p implied by 
the OLS estimates of equation (6.94). Finding them requires the solution to 
the previous exercise. 


Repeat this exercise, using 0 as the starting value for all three parameters. 
Does the algorithm converge as rapidly as it did before? Do you obtain the 
same estimates? If not, which ones are actually the NLS estimates? 


Test the restrictions that the nonlinear model imposes on the model (6.94) 
by means of an asymptotic F test. 


Using the estimates of the model (6.95) from the previous question, generate a 
single set of simulated data cf for the period 1953:1 to 1996:4. The simulation 
should be conditional on the pre-sample value (that is, the value for 1952:4) of 
log consumption. Do this in two different ways. First, generate error terms u; 
that follow an AR(1) process, and then generate the cj in terms of these už. 
Next, perform the simulation directly in terms of the innovations £¥, using the 
nonlinear model obtained by imposing the appropriate restrictions on (6.94). 
Show that, if you use the same realizations for the ež, the simulated values 
cý are identical. Estimate the model (6.95) using your simulated data. 


The nonlinear model obtained from (6.95) has just three parameters: do, 61, 
and p. It can therefore be estimated by the method of moments using three 
exogenous or predetermined variables. Estimate the model using the constant 
and the three possible choices of two variables from the set of nonconstant 
explanatory variables in (6.94). 


Formulate a GNR, based on estimates under the null hypothesis, that allows 
you to use a t test to test the restriction imposed on the model (6.94) by the 
model (6.95). Compare the P value for this (asymptotic) t test with the one 
for the F test of Exercise 6.12. 


Starting from the unconstrained estimates provided by (6.94), obtain one- 
step efficient estimates of the parameters of (6.95) using the GNR associated 
with that model. Use the GNR iteratively so as to approach the true NLS 
estimates more closely, until such time as the sum of squared residuals from 
the GNR is within 1078 of the one obtained by NLS estimation. Compare 
the number of iterations of this GNR-based procedure with the number used 
by the NLS algorithm of your software package. 


Formulate a GNR, based on estimates under the alternative hypothesis, to 
test the restriction imposed on the model (6.94) by the model (6.95). Your 
test procedure should just require two OLS regressions. 


Using 199 bootstrap samples, compute a parametric bootstrap P value for 
the test statistic obtained in Exercise 6.17. Assume that the error terms are 
normally distributed. 


Test the hypothesis that yo +71 = 0 in (6.94). Do this in three different ways, 
two of which are valid in the presence of heteroskedasticity of unknown form. 
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6.20 For the nonlinear regression model defined implicitly by (6.95) and estimated 


6.21 


6.22 


using the data in the file consumption.data, perform three different tests of the 
hypothesis that all the coefficients are the same for the two subsamples 1953:1 
to 1970:4 and 1971:1 to 1996:4. Firstly, use an asymptotic F test based on 
nonlinear estimation of both the restricted and unrestricted models. Secondly, 
use an asymptotic F test based on a GNR which requires nonlinear estimation 
only under the null. Finally, use a test that is robust to heteroskedasticity of 
unknown form. Hint: See regressions (6.91) and (6.92). 


The original HRGNR proposed by Davidson and MacKinnon (1985a) is 
t= UM, X2bə + residuals, (6.96) 


where U, Xij; and Xo are as defined in Section 6.8, b2 is a k2-vector, and Mı 
is the matrix that projects orthogonally on to SX) The test statistic for 
the null hypothesis that G2 = 0 is n minus the SSR from regression (6.96). 


Use regression (6.96), where all the matrices are evaluated at restricted NLS 
estimates, to retest the hypothesis of the previous question. Comment on the 
relationship between the test statistic you obtain and the heteroskedasticity- 


robust test statistic of the previous question that was based on regressions 
(6.91) and (6.92). 


Suppose that P is a projection matrix with rank r. Without loss of generality, 
we can assume that P projects on to the span of the columns of an n xr matrix 
Z. Suppose further that the n-vector z is distributed as IID(0, oI). Show 
that the quadratic form z'Pz follows the x? (r) distribution asymptotically 
as n — oo. (Hint: See the proof of Theorem 4.1.) 
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7.1 Introduction 


If the parameters of a regression model are to be estimated efficiently by least 
squares, the error terms must be uncorrelated and have the same variance. 
These assumptions are needed to prove the Gauss-Markov Theorem and to 
show that the nonlinear least squares estimator is asymptotically efficient; see 
Sections 3.5 and 6.3. Moreover, the usual estimators of the covariance matrices 
of the OLS and NLS estimators are not valid when these assumptions do not 
hold, although alternative “sandwich” covariance matrix estimators that are 
asymptotically valid may be available (see Sections 5.5, 6.5, and 6.8). Thus 
it is clear that we need new estimation methods to handle regression models 
with error terms that are heteroskedastic, serially correlated, or both. We 
develop some of these methods in this chapter. 


Since heteroskedasticity and serial correlation affect both linear and nonlinear 
regression models in the same way, there is no harm in limiting our attention 
to the simpler, linear case. We will be concerned with the model 


y=XBt+u, E(uu')=2, (7.01) 


where 2, the covariance matrix of the error terms, is a positive definite n x n 
matrix. If Q is equal to o7I, then (7.01) is just the linear regression model 
(3.03), with error terms that are uncorrelated and homoskedastic. If @ is 
diagonal with nonconstant diagonal elements, then the error terms are still 
uncorrelated, but they are heteroskedastic. If @ is not diagonal, then u;i 
and u; are correlated whenever 2;;, the ijt element of 2, is nonzero. In 
econometrics, covariance matrices that are not diagonal are most commonly 
encountered with time-series data, and the correlations are usually highest for 
observations that are close in time. 


In the next section, we obtain an efficient estimator for the vector 6 in the 
model (7.01) by transforming the regression so that it satisfies the conditions of 
the Gauss-Markov theorem. This efficient estimator is called the generalized 
least squares, or GLS, estimator. Although it is easy to write down the GLS 
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estimator, it is not always easy to compute it. In Section 7.3, we therefore 
discuss ways of computing GLS estimates, including the particularly simple 
case of weighted least squares. In the following section, we relax the often 
implausible assumption that the matrix Q is completely known. Section 7.5 
discusses some aspects of heteroskedasticity. Sections 7.6 through 7.9 deal 
with various aspects of serial correlation, including autoregressive and moving 
average processes, testing for serial correlation, GLS and NLS estimation of 
models with serially correlated errors, and specification tests for models with 
serially correlated errors. Finally, Section 7.10 discusses error-components 
models for panel data. 


7.2 The GLS Estimator 


In order to obtain an efficient estimator of the parameter vector B of the lin- 
ear regression model (7.01), we transform the model so that the transformed 
model satisfies the conditions of the Gauss-Markov theorem. Estimating the 
transformed model by OLS therefore yields efficient estimates. The transfor- 
mation is expressed in terms of an nx n matrix W, which is usually triangular, 
that satisfies the equation 

R -= "wv" (7.02) 


As we discussed in Section 3.4, such a matrix can always be found, often by 
using Crout’s algorithm. Premultiplying (7.01) by Y! gives 


Wy W'XB4W'u. (7.03) 


Because the covariance matrix (2 is nonsingular, the matrix W must be as 
well, and so the transformed regression model (7.03) is perfectly equivalent to 
the original model (7.01). The OLS estimator of @ from regression (7.03) is 


Bas = (XTP PX) 1X wy = (XTX XT Aly. (7.04) 


This estimator is called the generalized least squares, or GLS, estimator of 68. 


It is not difficult to show that the covariance matrix of the transformed error 
vector Plu is simply the identity matrix: 


B(Wluu'W) = 'E(uu')v = w'QW 
= p' (pp! ly = p'(p')tp iy =]. 


The second equality in the second line here uses a result about the inverse of 
a product of square matrices that was proved in Exercise 1.15. 


Since Bare is just the OLS estimator from (7.03), its covariance matrix can 
be found directly from the standard formula for the OLS covariance matrix, 
expression (3.28), if we replace X by W'X and oĉ by 1: 


Var (Gears) = (X' Ww XY! = (xa xy 1. (7.05) 
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In order for (7.05) to be valid, the conditions of the Gauss-Markov theorem 
must be satisfied. Here, this means that (2 must be the covariance matrix 
of u conditional on the explanatory variables X. It is thus permissible for Q 
to depend on X, or indeed on any other exogenous variables. 


The generalized least squares estimator Bais can also be obtained by mini- 
mizing the GLS criterion function 


(y — XB) Q (y - XB), (7.06) 


which is just the sum of squared residuals from the transformed regres- 
sion (7.03). This criterion function can be thought of as a generalization 
of the SSR function in which the squares and cross products of the residuals 
from the original regression (7.01) are weighted by the inverse of the matrix Q. 
The effect of such a weighting scheme is clearest when 2 is a diagonal matrix: 
In that case, each observation is simply given a weight proportional to the 
inverse of the variance of its error term. 


Efficiency of the GLS Estimator 


The GLS estimator Bars defined in (7.04) is also the solution of the set of 
moment conditions 


X'Q"(y — XBers) = 0. (7.07) 


These moment conditions are equivalent to the first-order conditions for the 
minimization of the GLS criterion function (7.06). 


Since the GLS estimator is a method of moments estimator, it is interesting to 
compare it with other MM estimators. A general MM estimator for the linear 
regression model (7.01) is defined in terms of an n x k matrix of exogenous 
variables W, where k is the dimension of 68, by the equations 


W'(y — XB) = 0. (7.08) 


These equations are a special case of the moment conditions (6.10) for the 
nonlinear regression model. Since there are k equations and k unknowns, we 
can solve (7.08) to obtain the MM estimator 


Bw = (W XY Wy. (7.09) 


The GLS estimator (7.04) is evidently a special case of this MM estimator, 
with W = Q“1X. 

Under certain assumptions, the MM estimator (7.09) is unbiased for the model 
(7.01). Suppose that the DGP is a special case of that model, with parameter 
vector Bo and known covariance matrix 2. We assume that X and W are ex- 
ogenous, which implies that E(u | X, W) = 0. This rather strong assumption, 
which is analogous to the assumption (3.08), is necessary for the unbiasedness 
of Bw and makes it unnecessary to resort to asymptotic analysis. If we merely 
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wanted to prove that Bw is consistent, we could, as in Section 6.2, get away 
with the much weaker assumption that E(u; |W) = 0. 


Substituting XG + u for y in (7.09), we see that 
Bw = Bo + (W' XY tW 'u. 
Therefore, the covariance matrix of Bw is 


Var (Bw) = E((8w — Bo)(Bw — Bo)") 
=E((W'X)'W'luu' W(x'w)') (7.10) 
=(W'X) 'W'nw(x'w)! 


As we would expect, this is a sandwich covariance matrix. When W = X, 
we have the OLS estimator, and Var(Gw) reduces to expression (5.32). 


The efficiency of the GLS estimator can be verified by showing that the differ- 
ence between (7.10), the covariance matrix for the MM estimator Bw defined 
n (7.09), and (7.05), the covariance matrix for the GLS estimator, is a posi- 
tive semidefinite matrix. As was shown in Exercise 3.8, this difference will be 
positive semidefinite if and only if the difference between the inverse of (7.05) 
and the inverse of (7.10), that is, the matrix 


X'N'X —- X'W(W'R2W)'w'x, (7.11) 


is positive semidefinite. In exercise 7.2, readers are invited to show that this 
is indeed the case. 


The GLS estimator Bats is typically more efficient than the more general MM 
estimator Bw for all elements of 6, because it is only in very special cases 
that the matrix (7.11) will have any zero diagonal elements. Because the OLS 
estimator Bi is just Bw when W = X, we conclude that the GLS estimator 
Bats will in most cases be more efficient, and will never be less efficient, than 
the OLS estimator (3. 


7.3 Computing GLS Estimates 


At first glance, the formula (7.04) for the GLS estimator seems quite simple. 
To calculate Bars when (2 is known, we apparently just have to invert 22, 
form the matrix X'Q-1X and invert it, then form the vector X'Q-!y, and, 
finally, postmultiply the inverse of X'|Q-1X by X'Q-'y. However, GLS 
estimation is not nearly as easy as it looks. The procedure just described 
may work acceptably when the sample size n is small, but it rapidly becomes 
computationally infeasible as n becomes large. The problem is that 92 is an 
nxn matrix. When n = 1000, simply storing 2 and its inverse will typically 
require 16 MB of memory; when n = 10,000, storing both these matrices 
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will require 1600 MB. Even if enough memory were available, computing GLS 
estimates in this naive way would be enormously expensive. 


Practical procedures for GLS estimation require us to know quite a lot about 
the structure of the covariance matrix 2 and its inverse. GLS estimation will 
be easy to do if the matrix W, defined in (7.02), is known and has a form that 
allows us to calculate W'a, for any vector x, without having to store W itself 
in memory. If so, we can easily formulate the transformed model (7.03) and 
estimate it by OLS. 


There is one important difference between (7.03) and the usual linear regres- 
sion model. For the latter, the variance of the error terms is unknown, while 
for the former, it is known to be 1. Since we can obtain OLS estimates without 
knowing the variance of the error terms, this suggests that we should not need 
to know everything about (2 in order to obtain GLS estimates. Suppose that 
Q = o7A, where the n x n matrix A is known to the investigator, but the 
positive scalar g? is unknown. Then if we replace Q by A in the definition 
(7.02) of Ý, we can still run regression (7.03), but the error terms will now 
have variance o? instead of variance 1. When we run this modified regression, 
we will obtain the estimate 


(XTA IX) XT ly = (XTR IXY XTR- ty = Bars, 


where the equality follows immediately from the fact that 07/0? = 1. Thus 
the GLS estimates will be the same whether we use 2 or A, that is, whether 
or not we know co”. However, if a? is known, we can use the true covariance 
matrix (7.05). Otherwise, we must fall back on the estimated covariance 
matrix 


Var (Bais) = (XTA XY}, 


where s? is the usual OLS estimate (3.49) of the error variance from the 
transformed regression. 


Weighted Least Squares 


It is particularly easy to obtain GLS estimates when the error terms are 
heteroskedastic but uncorrelated. This implies that the matrix 2 is diagonal. 
Let w? denote the ¢* diagonal element of R. Then Q-! is a diagonal matrix 
with t*" diagonal element w; °, and W can be chosen as the diagonal matrix 
with t* diagonal element wy ' Thus we see that, for a typical observation, 
regression (7.03) can be written as 


wy = wr XB + wy u. (7.12) 
This regression is to be estimated by OLS. The regressand and regressors are 
simply the dependent and independent variables multiplied by w; 1 and the 


variance of the error term is clearly 1. 
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For obvious reasons, this special case of GLS estimation is often called 
weighted least squares, or WLS. The weight given to each observation when 
we run regression (7.12) is w7 '. Observations for which the variance of the 
error term is large are given low weights, and observations for which it is 
small are given high weights. In practice, if Q = 7A, with A known but g? 
unknown, regression (7.12) remains valid, provided we reinterpret w? as the 
t*® diagonal element of A and recognize that the variance of the error terms 
is now g? instead of 1. 


There are various ways of determining the weights used in weighted least 
squares estimation. In the simplest case, either theory or preliminary testing 
may suggest that E(u?) is proportional to z?, where z, is some variable that 
we observe. For example, z might be a variable like population or national 
income. In this case, z; plays the role of w in equation (7.12). Another 
possibility is that the data we actually observe were obtained by grouping data 
on different numbers of individual units. Suppose that the error terms for the 
ungrouped data have constant variance, but that observation t is the average 
of N; individual observations, where N; varies. Special cases of standard 
results, discussed in Section 3.4, on the variance of a sample mean imply that 
the variance of us will then be proportional to 1/N;. Thus, in this case, Nj-!/? 
plays the role of w, in equation (7.12). 


Weighted least squares estimation can easily be performed using any program 
for OLS estimation. When one is using such a procedure, it is important to 
remember that all the variables in the regression, including the constant term, 
must be multiplied by the same weights. Thus if, for example, the original 
regression is 

Yt = Gi t+ BoXe + ut, 


the weighted regression will be 


Yt /we = G1 (1/we) + Bo(X¢/we) + us/wr. 


Here the regressand is y;/w:, the regressor that corresponds to the constant 
term is 1/w,, and the regressor that corresponds to X; is X;/wy. 


It is possible to report summary statistics like R?, ESS, and SSR either in 
terms of the dependent variable y; or in terms of the transformed regressand 
yt/w:. However, it really only makes sense to report R? in terms of the 
transformed regressand. As we saw in Section 2.5, R? is valid as a measure 
of goodness of fit only when the residuals are orthogonal to the fitted values. 
This will be true for the residuals and fitted values from OLS estimation of 
the weighted regression (7.12), but it will not be true if those residuals and 
fitted values are subsequently multiplied by the w, in order to make them 
comparable with the original dependent variable. 
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Generalized Nonlinear Least Squares 


Although, for simplicity, we have focused on the linear regression model, GLS 
is also applicable to nonlinear regression models. If the vector of regression 
functions were 2(@) instead of X8, we could obtain generalized nonlinear 
least squares, or GNLS, estimates by minimizing the criterion function 


(y — x(8))'Q-*(y— z(8)), (7.13) 


which looks just like the GLS criterion function (7.06) for the linear regression 
model, except that 2(@) replaces XG. If we differentiate (7.13) with respect 
to @ and divide the result by —2, we obtain the moment conditions 


X(8)Q" (y — x(6)) =0, (7.14) 


where, as in Chapter 6, X (3) is the matrix of derivatives of x(3) with respect 
to B. These moment conditions generalize conditions (6.27) for nonlinear least 
squares in the obvious way, and they are evidently equivalent to the moment 
conditions (7.07) for the linear case. 


Finding estimates that solve equations (7.14) will require some sort of non- 
linear minimization procedure; see Section 6.4. For this purpose, and several 
others, the GNR 


w'(y — x(B)) = 'X(B)b + residuals. (7.15) 


will often be useful. Equation (7.15) is just the ordinary GNR. introduced 
in equation (6.52), with the regressand and regressors premultiplied by the 
matrix W' implicitly defined in equation (7.02). It is the GNR. associated with 
the nonlinear regression model 


P'y = W'x(B)+0'u, (7.16) 


which is analogous to (7.03). The error terms of (7.16) have covariance matrix 
proportional to the identity matrix. 


Let us denote the t column of the matrix W by y». Then the asymptotic 
theory of Chapter 6 for the nonlinear regression model and the ordinary GNR 
applies also to the transformed regression model (7.16) and its associated 
GNR (7.15), provided that the transformed regression functions Y æ(8) are 
predetermined with respect to the transformed error terms Yy u: 


E(Yřu | Yre(8)) = 0. (7.17) 


If W is not a diagonal matrix, this condition is different from the condition that 
the regression functions x;() should be predetermined with respect to the uz. 
Later in this chapter, we will see that this fact has serious repercussions in 
models with serial correlation. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


262 Generalized Least Squares and Related Topics 


7.4 Feasible Generalized Least Squares 


In practice, the covariance matrix (2 is often not known even up to a scalar 
factor. This makes it impossible to compute GLS estimates. However, in many 
cases it is reasonable to suppose that 92, or A, depends in a known way on 
a vector of unknown parameters y. If so, it may be possible to estimate ~y 
consistently, so as to obtain (2(¥), say. Then W(+) can be defined as in (7.02), 
and GLS estimates computed conditional on W(¥). This type of procedure is 
called feasible generalized least squares, or feasible GLS, because it is feasible 
in many cases when ordinary GLS is not. 


As a simple example, suppose we want to obtain feasible GLS estimates of 
the linear regression model 


Ye = Xib +u, E(u) = exp(Z7), (7.18) 


where 6 and y are, respectively, a k-vector and an l-vector of unknown para- 
meters, and X; and Z; are conformably dimensioned row vectors of observa- 
tions on exogenous or predetermined variables that belong to the information 
set on which we are conditioning. Some or all of the elements of Z, may well 
belong to X;. The function exp(Z;7y) is an example of a skedastic function. 
In the same way that a regression function determines the conditional mean 
of a random variable, a skedastic function determines its conditional variance. 
The skedastic function exp(Z;7) has the property that it is positive for any 
vector y. This is a desirable property for any skedastic function to have, since 
negative estimated variances would be highly inconvenient. 


In order to obtain consistent estimates of y, usually we must first obtain 
consistent estimates of the error terms in (7.18). The obvious way to do so is 
to start by computing OLS estimates Ĝĝ. This allows us to calculate a vector 
of OLS residuals with typical element i. We can then run the auxiliary linear 
regression 

log a? = Ziy + vr, (7.19) 


over observations t = 1,...,n to find the OLS estimates +. These estimates 


are then used to compute 


a ery: 
Wt = (exp(Z:4)) : 
for all t. Finally, feasible GLS estimates of 8B are obtained by using ordinary 


least squares to estimate regression (7.12), with the estimates w, replacing the 
unknown w. This is an example of feasible weighted least squares. 


Why Feasible GLS Works 


Under suitable regularity conditions, it can be shown that this type of proce- 
dure yields a feasible GLS estimator Gp that is consistent and asymptotically 
equivalent to the GLS estimator Bars. We will not attempt to provide a 
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rigorous proof of this proposition; for that, see Amemiya (1973a). However, 
we will try to provide an intuitive explanation of why it is true. 


If we substitute Xp + u for y into expression (7.04), the formula for the GLS 
estimator, we find that 


Bats = Bo + (X'Q 71 XY IX Qu. 


Taking 6o over to the left-hand side, multiplying each factor by an appropriate 
power of n, and taking probability limits, we see that 


R =I 
a — 9) £ ( plim + X™2"'Xx) ( plim n XTA u). (7.20) 
n— o0 n— oo 

Under standard assumptions, the first matrix on the right-hand side is a 
nonstochastic k x k matrix with full rank, while the vector that postmultiplies 
it is a stochastic vector which follows the multivariate normal distribution. 


For the feasible GLS estimator, the analog of (7.20) is 
r =i 
n'/?2(Bp — Bo) = ( plim 1 X™Q-(4)X) ( plim n XTA A )u). (7.21) 


The right-hand sides of expressions (7.21) and (7.20) look very similar, and it 
is clear that the latter will be asymptotically equivalent to the former if 


„n 1 -1/4 san al = 
plim ARAMARK = plim — X'Q'X (7.22) 
and 
plim nP XIR HF )u = plim nP XIR tu. (7.23) 


A rigorous statement and proof of the conditions under which equations (7.22) 
and (7.23) hold is beyond the scope of this book. If they are to hold, it is 
desirable that y should be a consistent estimator of y, and this requires that 
the OLS estimator 8 should be consistent. For example, it can be shown 
that the estimator obtained by running regression (7.19) would be consistent 
if the regressand depended on us rather than t;. Since the regressand is 
actually a, it is necessary that the residuals ti, should consistently estimate 
the error terms uz. This in turn requires that B should be consistent for Jo. 
Thus, in general, we cannot expect ¥ to be consistent if we do not start with 
a consistent estimator of 6. 


Unfortunately, as we will see later, if (y) is not diagonal, then the OLS 
estimator @ is, in general, not consistent whenever any element of X; is a 
lagged dependent variable. A lagged dependent variable is predetermined with 
respect to error terms that are innovations, but not with respect to error terms 
that are serially correlated. With GLS or feasible GLS estimation, the problem 
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does not arise, because, if the model is correctly specified, the transformed 
explanatory variables are predetermined with respect to the transformed error 
terms, as in (7.17). When the OLS estimator is inconsistent, we will have to 
obtain a consistent estimator of y in some other way. 


Whether or not feasible GLS is a desirable estimation method in practice 
depends on how good an estimate of {2 can be obtained. If QR(Ẹ) is a very 
good estimate, then feasible GLS will have essentially the same properties as 
GLS itself, and inferences based on the GLS covariance matrix (7.05), with 
92(*¥) replacing 2, should be reasonably reliable, even though they will not 
be exact in finite samples. Note that condition (7.22), in addition to being 
necessary for the validity of feasible GLS, guarantees that the feasible GLS 
covariance matrix estimator converges as n — oo to the true GLS covariance 
matrix. On the other hand, if 2(+) is a poor estimate, feasible GLS estimates 
may have quite different properties from real GLS estimates, and inferences 
may be quite misleading. 


It is entirely possible to iterate a feasible GLS procedure. The estimator Êr 
can be used to compute new set of residuals, which can then be used to obtain 
a second-round estimate of y, which can be used to calculate second-round 
feasible GLS estimates, and so on. This procedure can either be stopped after 
a predetermined number of rounds or continued until convergence is achieved 
(if it ever is achieved). Iteration does not change the asymptotic distribution 
of the feasible GLS estimator, but it does change its finite-sample distribution. 


Another way to estimate models in which the covariance matrix of the error 
terms depends on one or more unknown parameters is to use the method of 
maximum likelihood. This estimation method, in which 8 and y are estimated 
jointly, will be discussed in Chapter 10. In many cases, an iterated feasible 
GLS estimator will be the same as a maximum likelihood estimator based on 
the assumption of normally distributed errors. 


7.5 Heteroskedasticity 


There are two situations in which the error terms are heteroskedastic but seri- 
ally uncorrelated. In the first, the form of the heteroskedasticity is completely 
unknown, while, in the second, the skedastic function is known except for the 
values of some parameters that can be estimated consistently. Concerning the 
case of heteroskedasticity of unknown form, we saw in Sections 5.5 and 6.5 
how to compute asymptotically valid covariance matrix estimates for OLS 
and NLS parameter estimates. The fact that these HCCMEs are sandwich 
covariance matrices makes it clear that, although they are consistent under 
standard regularity conditions, neither OLS nor NLS is efficient when the 
error terms are heteroskedastic. 


If the variances of all the error terms are known, at least up to a scalar 
factor, then efficient estimates can be obtained by weighted least squares, 
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which we discussed in Section 7.3. For a linear model, we need to multiply 
all of the variables by w;, *, the inverse of the standard error of u+, and then 
use ordinary least squares. The usual OLS covariance matrix will be perfectly 
valid, although it is desirable to replace s? by 1 if the variances are completely 
known, since in that case s? — 1 as n — oo. For a nonlinear model, we need 
to multiply the dependent variable and the entire regression function by w;! 
and then use NLS. Once again, the usual NLS covariance matrix will be 
asymptotically valid. 


If the form of the heteroskedasticity is known, but the skedastic function 
depends on unknown parameters, then we can use feasible weighted least 
squares and still achieve asymptotic efficiency. An example of such a pro- 
cedure was discussed in the previous section. As we have seen, it makes 
no difference asymptotically whether the uw, are known or merely estimated 
consistently, although it can certainly make a substantial difference in finite 
samples. Asymptotically, at least, the usual OLS or NLS covariance matrix 
is just as valid with feasible WLS as with WLS. 


Testing for Heteroskedasticity 


In some cases, it may be clear from the specification of the model that the 
error terms must exhibit a particular pattern of heteroskedasticity. In many 
cases, however, we may hope that the error terms are homoskedastic but be 
prepared to admit the possibility that they are not. In such cases, if we 
have no information on the form of the skedastic function, it may be prudent 
to employ an HCCME, especially if the sample size is large. In a number of 
simulation experiments, Andrews (1991) has shown that, when the error terms 
are homoskedastic, use of an HCCME, rather than the usual OLS covariance 
matrix, frequently has little cost. However, as we saw in Exercise 5.12, this 
is not always true. In finite samples, tests and confidence intervals based on 
HCCMEs will always be somewhat less reliable than ones based on the usual 
OLS covariance matrix when the latter is appropriate. 


If we have information on the form of the skedastic function, we might well 
wish to use weighted least squares. Before doing so, it is advisable to perform a 
specification test of the null hypothesis that the error terms are homoskedastic 
against whatever heteroskedastic alternatives may seem reasonable. There are 
many ways to perform this type of specification test. The simplest approach 
that is widely applicable, and the only one that we will discuss, involves 
running an artificial regression in which the regressand is the vector of squared 
residuals from the model under test. 


A reasonably general model of conditional heteroskedasticity is 
E(uj |Q) = hô + Zey), (7.24) 


where the skedastic function h(-) is a nonlinear function that can take on 
only positive values, Z, is a 1 x r vector of observations on exogenous or 
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predetermined variables that belong to the information set ;, 6 is a scalar 
parameter, and y is an r—vector of parameters. Under the null hypothesis 
that y = 0, the function h( + Zy) collapses to h(d), a constant. One 
plausible specification of the skedastic function is 


h(6 + Ziy) = exp(6 + Ziy) = exp(d) exp( Zr). 


Under this specification, the variance of u; reduces to the constant o? = exp(ô) 
when y = 0. Since, as we will see, one of the advantages of tests based on 
artificial regressions is that they do not depend on the functional form of h(-), 
there is no need for us to consider specifications less general than (7.24). 


If we define v, as the difference between u? and its conditional expectation, 
we can rewrite equation (7.24) as 


u? = h(ô + Ziy) + v, (7.25) 


which has the form of a regression model. While we would not expect the error 
term v; to be as well behaved as the error terms in most regression models, 
since the distribution of u? will almost always be skewed to the right, it does 
have mean zero by definition, and we will assume that it has a finite, and 
constant, variance. This assumption would probably be excessively strong if y 
were nonzero, but it seems perfectly reasonable to assume that the variance 
of v+ is constant under the null hypothesis that y = 0. 


Suppose, to begin with, that we actually observe the us. Since (7.25) has the 
form of a regression model, we can then test the null hypothesis that y = 0 by 
using a Gauss-Newton regression. Suppose the sample mean of the u? is oe, 
Then the obvious estimate of ô under the null hypothesis is just 6 = h~!(6?). 


The GNR corresponding to (7.25) is 
u? — h(ô + Ziy) = W (8 + Ziy)bs + h (8 + Ziy)Z,by + residual, 


where h'(-) denotes the first derivative of h(-), bs is the coefficient that cor- 
responds to ô, and by is the r—vector of coefficients that corresponds to +. 
When it is evaluated at 6 = ô and y = 0, this GNR simplifies to 


u? — 6? = h'(d)bs +h’ (5) Zby + residual. (7.26) 
Since h’ (8) is just a constant, its presence has no effect on the explanatory 
power of the regression. Moreover, since regression (7.26) includes a constant 
term, both the SSR and the centered R? will be unchanged if we do not bother 
to subtract a? from the left-hand side. Thus, for the purpose of testing the 
null hypothesis that y = 0, regression (7.26) is equivalent to the regression 


u? = b + Z,by + residual, (7.27) 
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with a suitable redefinition of the artificial parameters bs and by. Observe 
that regression (7.27) does not depend on the functional form of h(-). Stan- 
dard results for tests based on the GNR. imply that the ordinary F statistic 
for b, = 0 in this regression, which is printed by most regression packages, 
will be asymptotically distributed as F'(r,co) under the null hypothesis; see 
Section 6.7. Another valid test statistic is n times the centered R? from this 
regression, which will be asymptotically distributed as y?(r). 


In practice, of course, we do not actually observe the u;. However, as we 
noted in Sections 3.6 and 6.3, least squares residuals converge asymptotically 
to the corresponding error terms when the model is correctly specified. Thus 
it seems plausible that the test will still be asymptotically valid if we replace 
u? in regression (7.27) by û?, the t*® squared residual from least squares 


estimation of the model under test. The test regression then becomes 
a? = bs + Zby + residual. (7.28) 


It can be shown that replacing u? by û? does not change the asymptotic 
distribution of the F and nR? statistics for testing the hypothesis b = 0; see 
Davidson and MacKinnon (1993, Section 11.5). Of course, since the finite- 
sample distributions of these test statistics may differ substantially from their 
asymptotic ones, it is a very good idea to bootstrap them when the sample 
size is small or moderate. This will be discussed further in Section 7.7. 


Tests based on regression (7.28) require us to choose Z;, and there are many 
ways to do so. One approach is to include functions of some of the original 
regressors. As we saw in Section 5.5, there are circumstances in which the 
usual OLS covariance matrix is valid even when there is heteroskedasticity. 
White (1980) showed that, in a linear regression model, if E(u?) is constant 
conditional on the squares and cross-products of all the regressors, then there 
is no need to use an HCCME. He therefore suggested that Z, should consist of 
the squares and cross-products of all the regressors, because, asymptotically, 
such a test will reject the null whenever heteroskedasticity causes the usual 
OLS covariance matrix to be invalid. However, unless the number of regressors 
is very small, this suggestion will result in r, the dimension of Z;, being very 
large. As a consequence, the test is likely to have poor finite-sample properties 
and low power, unless the sample size is quite large. 


If economic theory does not tell us how to choose Z;, there is no simple, 
mechanical rule for choosing it. The more variables that are included in Z;, 
the greater is likely to be their ability to explain any observed pattern of het- 
eroskedasticity, but the more degrees of freedom the test statistic will have. 
Adding a variable that helps substantially to explain the u? will surely increase 
the power of the test. However, adding variables with little explanatory power 
may simply dilute test power by increasing the number of degrees of freedom 
without increasing the noncentrality parameter; recall the discussion in Sec- 
tion 4.7. This is most easily seen in the context of x? tests, where the critical 
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values increase monotonically with the number of degrees of freedom. For a 
test with, say, r+ 1 degrees of freedom to have as much power as a test with r 
degrees of freedom, the noncentrality parameter for the former test must be 
a certain amount larger than the noncentrality parameter for the latter. 


7.6 Autoregressive and Moving Average Processes 


The error terms for nearby observations may be correlated, or may appear to 
be correlated, in any sort of regression model, but this phenomenon is most 
commonly encountered in models estimated with time-series data, where it is 
known as serial correlation or autocorrelation. In practice, what appears to 
be serial correlation may instead be evidence of a misspecified model, as we 
discuss in Section 7.9. In some circumstances, though, it is natural to model 
the serial correlation by assuming that the error terms follow some sort of 
stochastic process. Such a process defines a sequence of random variables. 
Some of the stochastic processes that are commonly used to model serial 
correlation will be discussed in this section. 


If there is reason to believe that serial correlation may be present, the first step 
is usually to test the null hypothesis that the errors are serially uncorrelated 
against a plausible alternative that involves serial correlation. Several ways of 
doing this will be discussed in the next section. The second step, if evidence 
of serial correlation is found, is to estimate a model that accounts for it. 
Estimation methods based on NLS and GLS will be discussed in Section 7.8. 
The final step, which is extremely important but is often omitted, is to verify 
that the model which accounts for serial correlation is compatible with the 
data. Some techniques for doing so will be discussed in Section 7.9. 


The AR(1) Process 


One of the simplest and most commonly used stochastic processes is the first- 
order autoregressive process, or AR(1) process. We have already encountered 
regression models with error terms that follow such a process in Sections 6.1 
and 6.6. Recall from (6.04) that the AR(1) process can be written as 


Up = puti +E, €~IID(0,07), |p| <1. (7.29) 


The error at time t is equal to some fraction p of the error at time t — 1, with 
the sign changed if p < 0, plus the innovation e+. Since it is assumed that €+ 
is independent of £, for all s Æ t, e; evidently is an innovation, according to 
the definition of that term in Section 4.5. 


The condition in equation (7.29) that |p| < 1 is called a stationarity condition, 
because it is necessary for the AR(1) process to be stationary. There are 
several definitions of stationarity in time series analysis. According to the 
one that interests us here, a series with typical element u+ is stationary if the 
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unconditional expectation E(u;) and the unconditional variance Var(uz) exist 
and are independent of t, and if the covariance Cov(u;, uz_;) is also, for any 
given 7, independent of t. This particular definition is sometimes referred to 
as covariance stationarity, or wide sense stationarity. 


Suppose that, although we begin to observe the series only once t = 1, the 
series has been in existence for an infinite time. We can then compute the 
variance of u; by substituting successively for uz-1, Ut—2, Ut—3, and so on in 
(7.29). We see that 


Ut = Et + pEt-1 + pea + pees fees, (7.30) 


Using the fact that the innovations £+, ¢;_1,... are independent, and therefore 
uncorrelated, the variance of us is seen to be 


Oy, = Vertu) =o; Hoos + pag + plog +++ = 


J l (7.31) 


The last expression here is indeed independent of t, as required for a stationary 
process, but the last equality can be true only if the stationarity condition 
lo| < 1 holds, since that condition is necessary for the infinite series 1 + p? + 
p + pÊ +--+ to converge. In addition, if |p| > 1, the last expression in (7.31) 
is negative, and so cannot be a variance. In most econometric applications, 
where u is the error term appended to a regression model, the stationarity 
condition is a very reasonable condition to impose, since, without it, the 
variance of the error terms would increase without limit as the sample size 
was increased. 


It is not necessary to make the rather strange assumption that u+ exists for 
negative values of t all the way to —oo. If we suppose that the expectation 
and variance of ui are respectively 0 and o2/(1 — p°), then we see at once 
that E(u2) = E(pu,) + E(£2) = 0, and that 


2 


0, 
Var(u2) = Var (pui + €2) = l; r 7 + 1) E 2 = Var(u1), 


where the second equality uses the fact that ¢2, because it is an innovation, is 
uncorrelated with u1. A simple recursive argument then shows that Var (u+) = 
o2/(1 — p?) for all t. 


The argument in (7.31) shows that o? = o2/(1 — p°) is the only admissible 
value for Var(uz) if the series is stationary. Consequently, if the variance 
of uz is not equal to 02, then the series cannot be stationary. However, if 
the stationarity condition is satisfied, Var(u;) must tend to a? as t becomes 
large. This can be seen by repeating the calculation in (7.31), but recognizing 
that the series has only a finite number of terms. As t grows, the number of 
terms becomes large, and the value of the finite sum tends to the value of the 
infinite series, which is the stationary variance oĉ. 
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It is not difficult to see that, for the AR(1) process (7.29), the covariance of 
uz and uz_1 is independent of t if Var(u;) = o? for all t. In fact, 


Cov(ug, Ue_1) = E (utu) = E((puz—1 oF Et) U1) = poż. 


In order to compute the correlation of uz and wz-1, we divide Cov(ur, ut—1) 
by the square root of the product of the variances of u, and uz_1, that is, 
by o2. We then find that the correlation of us and u,_1 is just p. 


More generally, as readers are asked to demonstrate in Exercise 7.4, under 
the assumption that Var(ui) = 02, the covariance of u, and uj, and also 
the covariance of u; and uz4;, is equal to p/02, independently of t. It follows 
that the AR(1) process (7.29) is indeed covariance stationary if Var(u ) = 02. 
The correlation between uz and uz; is of course just pî. Since pÍ tends 
to zero quite rapidly as j increases, except when |p| is very close to 1, this 
result implies that an AR(1) process will generally exhibit small correlations 
between observations that are far removed in time, but it may exhibit large 
correlations between observations that are close in time. Since this is precisely 
the pattern that is frequently observed in the residuals of regression models 
estimated using time-series data, it is not surprising that the AR(1) process 
is often used to account for serial correlation in such models. 


If we combine the result (7.31) with the result proved in Exercise 7.4, we see 
that, if the AR(1) process (7.29) is stationary, the covariance matrix of the 
vector u can be written as 


p P oo 
T °° ce 7.32 
Mapa): : J | (782) 
p po? pr? 1 


All the u; have the same variance, g2, which by (7.31) is the first factor on 
the right-hand side of (7.32). It follows that the other factor, the matrix in 
square brackets, which we denote A(p), is the matrix of correlations of the 
error terms. We will need to make use of (7.32) in Section 7.7 when we discuss 
GLS estimation of regression models with AR(1) errors. 


Higher-Order Autoregressive Processes 


Although the AR(1) process is very useful, it is quite restrictive. A much 
more general stochastic process is the p° order autoregressive process, or 
AR(p) process, 


Ut = p1Ut—1 + p2Ut—2 +... + PpUt-p tet, Et ~ ID(0, o2). (7.33) 


For such a process, uz; depends on up to p lagged values of itself, as well as 
on €;. The AR(p) process (7.33) can also be expressed as 


(1 — piL — pL? —---— ppLP)ut =£, e ~ ID(0, 02), (7.34) 
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where L denotes the lag operator. The lag operator L has the property that 
when L multiplies anything with a time subscript, this subscript is lagged 
one period. Thus Luz = u1, Lus = u2, L2ut = u3, and so on. The 
expression in parentheses in (7.34) is a polynomial in the lag operator L, with 
coefficients 1 and —p1,...,—pp. If we make the definition 


p(Z) = piz+ poz +--+ + pp2z” (7.35) 
for arbitrary z, we can write the AR(p) process (7.34) very compactly as 
(1 = p(L)) Ut = Et, Et ~ IID(0, o2). 


This compact notation is useful, but it does have two disadvantages: The 
order of the process, p, is not apparent, and there is no way of expressing any 
restrictions on the p;. 


The stationarity condition for an AR(p) process may be expressed in several 
ways. One of them, based on the definition (7.35), is that all the roots of the 
polynomial equation 

1—p(z)=0 (7.36) 


must lie outside the unit circle. This simply means that all of the (possibly 
complex) roots of equation (7.36) must be greater than 1 in absolute value.! 
This condition can lead to quite complicated restrictions on the p; for general 
AR(p) processes. The stationarity condition that |p| < 1 for an AR(1) pro- 
cess is evidently a consequence of this condition. In that case, (7.36) reduces 
to the equation 1— pz = 0, the unique root of which is z = 1/1, and this root 
will be greater than 1 in absolute value if and only if |pi| < 1. As with the 
AR(1) process, the stationarity condition for an AR(p) process is necessary 
but not sufficient. Stationarity requires in addition that the variances and 
covariances of u1, ..., Up Should be equal to their stationary values. If not, it 
remains true that Var(u,;) and Cov(u;,u,_;) tend to their stationary values 
for large t if the stationarity condition is satisfied. 


In practice, when an AR(p) process is used to model the error terms of a re- 
gression model, p is usually chosen to be quite small. By far the most popular 
choice is the AR(1) process, but AR(2) and AR(4) processes are also encoun- 
tered reasonably frequently. AR(4) processes are particularly attractive for 
quarterly data, because seasonality may cause correlation between error terms 
that are four periods apart. 


Moving Average Processes 


Autoregressive processes are not the only way to model stationary time series. 
Another type of stochastic process is the moving average, or MA, process. The 
simplest of these is the first-order moving average, or MA(1), process 


Up = Et + Q1Et-1; Et ~ IID(0, o2), (7.37) 


l Fora complex number a + bi, a and b real, the absolute value is (a? + ye 
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in which the error term w is a weighted average of two successive innovations, 
Et and Et-1- 


It is not difficult to calculate the covariance matrix for an MA(1) process. 
From (7.37), we see that the variance of wu; is 


ou = E((et + aret-1)”) = oF + aio: = (1+ af)oz, 
the covariance of up and u;_—1 is 
E((é + Qy1€4-1) (E4-1 + Q1€1—2)) = aioi, 


and the covariance of u, and u;—; for j > 1 is 0. Therefore, the covariance 
matrix of the entire vector u is 


1+a? a 0 > 0 0 
Qı 1+ a? ay >> 0 0 
02 A(a;) = o2 i (7.38) 
0 0 O > ay 1+4 a? 


It is evident from (7.38) that there is no correlation between error terms 
which are more than one period apart. Moreover, the correlation between 
successive error terms varies only between —0.5 and 0.5, the smallest and 
largest possible values of œ1/(1 + aî), which are achieved when ay = —1 
and a; = 1, respectively. Therefore, an MA(1) process cannot be appropriate 
when the observed correlation between successive residuals is large in absolute 
value, or when residuals that are not adjacent are correlated. 


Just as AR(p) processes generalize the AR(1) process, higher-order moving 
average processes generalize the MA(1) process. The qt order moving aver- 
age process, or MA(q) process, may be written as 


Ut = Et + Q1Et—1 + AQE¢-2Q H: + AgEt—-q, Et ~ IID(0, o2). (7.39) 
Using lag-operator notation, the process (7.39) can also be written as 
Ut = (1 + aiL Sie gL‘) Et = (1 + a(L)) Et, Et ~ IID(0, o2), 


where a(L) is a polynomial in the lag operator. 


Autoregressive processes, moving average processes, and other related stochas- 
tic processes have many important applications in both econometrics and 
macroeconomics. These processes will be discussed further in Chapter 13. 
Their properties have been studied extensively in the literature on time-series 
methods. A classic reference is Box and Jenkins (1976), which has been up- 
dated as Box, Jenkins, and Reinsel (1994). Books that are specifically aimed 
at economists include Granger and Newbold (1986), Harvey (1989), Hamilton 
(1994), and Hayashi (2000). 
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7.7 Testing for Serial Correlation 


Over the decades, an enormous amount of research has been devoted to the 
subject of specification tests for serial correlation in regression models. Even 
though a great many different tests have been proposed, many of them no 
longer of much interest, the subject is not really very complicated. As we show 
in this section, it is perfectly easy to test the null hypothesis that the error 
terms of a regression model are serially uncorrelated against the alternative 
that they follow an autoregressive process of any specified order. Most of the 
tests that we will discuss are straightforward applications of testing procedures 
which were introduced in Chapters 4 and 6. 


As we saw in Section 6.1, the linear regression model 
Ut = XB + Ut, Ut = PUt-1 + Et, Et ~ IID(0, a2), (7.40) 


in which the error terms follow an AR(1) process, can, if we ignore the first 
observation, be rewritten as the nonlinear regression model 


Yt = PYt-1 + XB = pX118 + Et, Et ~ IID(0, o2). (7.41) 


The null hypothesis that p = 0 can then be tested using any procedure that is 
appropriate for testing hypotheses about the parameters of nonlinear regres- 
sion models; see Section 6.7. 


One approach is just to estimate the model (7.41) by NLS and calculate the 
ordinary t statistic for p = 0. Because the model is nonlinear, and because 
it includes a lagged dependent variable, this t statistic will not follow the 
Student’s t distribution in finite samples, even if the error terms happen to 
be normally distributed. However, under the null hypothesis, it will follow 
the standard normal distribution asymptotically. The F statistic computed 
using the unrestricted SSR from (7.41) and the restricted SSR from an OLS 
regression of y on X for the period t = 2 to n is also asymptotically valid. 
Since the model (7.41) is nonlinear, this F statistic will not be numerically 
equal to the square of the t statistic in this case, although the two will be 
asymptotically equal under the null hypothesis. 


Tests Based on the GNR 


We can avoid having to estimate the nonlinear model (7.41) by using tests 
based on the Gauss-Newton regression. Let B denote the vector of OLS 
estimates obtained from the restricted model 


y= XB +u, (7.42) 


and let ù denote the vector of OLS residuals from this regression. Then, as 
we saw in Section 6.7, the GNR for testing the null hypothesis that p = 0 is 


u = Xb + b ŭı + residuals, (7.43) 
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where U is a vector with typical element t,—1; recall (6.84). The ordinary 
t statistic for b, = 0 in this regression will be asymptotically distributed as 
N(0, 1) under the null hypothesis. 


It is worth noting that the t statistic for b, = 0 in the GNR (7.43) is identical 
to the t statistic for b, = 0 in the regression 


y = XB+b,u, + residuals. (7.44) 


Regression (7.44) is just the original regression model (7.42) with the lagged 
OLS residuals from that model added as an additional regressor. By use of 
the FWL Theorem, it can readily be seen that (7.44) has the same SSR and 
the same estimate of b, as the GNR (7.43). Therefore, a GNR-based test for 
serial correlation is formally the same as a test for omitted variables, where 
the omitted variables are lagged residuals from the model under test. 


Although regressions (7.43) and (7.44) look perfectly simple, it is not quite 
clear how they should be implemented. Both the original regression (7.42) 
and the test regression (7.43) or (7.44) may be estimated either over the entire 
sample period or over the shorter period from t = 2 to n. If one of them is 
run over the full sample period and the other is run over the shorter period, 
then ù will not be orthogonal to X. This does not affect the asymptotic 
distribution of the t statistic, but it may affect its finite-sample distribution. 
The easiest approach is probably to estimate both equations over the entire 
sample period. If this is done, the unobserved value of tig must be replaced 
by 0 before the test regression is run. As Exercise 7.14 demonstrates, running 
the GNR (7.43) in different ways results in test statistics that are numerically 
different, even though they all follow the same asymptotic distribution under 
the null hypothesis. 


Tests based on the GNR have several attractive features in addition to ease of 
computation. Unlike some other tests that will be discussed shortly, they are 
asymptotically valid under the relatively weak assumption that E(u; | X+) = 0, 
which allows X; to include lagged dependent variables. Moreover, they are 
easily generalized to deal with nonlinear regression models. If the original 
model is nonlinear, we simply need to replace X; in the test regression (7.43) 
by X;(3), where, as usual, the it element of X;(@) is the derivative of the 
regression function with respect to the it parameter, evaluated at the NLS 
estimates 3 of the model being tested; see Exercise 7.5. 


Another very attractive feature of GNR-based tests is that they can readily 
be used to test against higher-order autoregressive processes and even moving 
average processes. For example, in order to test against an AR(p) process, we 
simply need to run the test regression 


ti, = Xib + bp, t1 +... + bop, tp + residual (7.45) 


and use an asymptotic F test of the null hypothesis that the coefficients on 
all the lagged residuals are zero; see Exercise 7.6. Of course, in order to run 
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regression (7.45), we will either need to drop the first p observations or replace 
the unobserved lagged values of t; with zeros. 


If we wish to test against an MA (q) process, it turns out that we can proceed 
exactly as if we were testing against an AR(q) process. The reason is that an 
autoregressive process of any order is locally equivalent to a moving average 
process of the same order. Intuitively, this means that, for large samples, an 
AR(q) process and an MA(q) process look the same in the neighborhood of 
the null hypothesis of no serial correlation. Since tests based on the GNR 
use information on first derivatives only, it should not be surprising that the 
GNRs used for testing against both alternatives turn out to be identical; see 
Exercise 7.7. 


The use of the GNR (7.43) for testing against AR(1) errors was first suggested 
by Durbin (1970). Breusch (1978) and Godfrey (1978a, 1978b) subsequently 
showed how to use GNRs to test against AR(p) and MA (q) errors. For a more 
detailed treatment of these and related procedures, see Godfrey (1988). 


Older, Less Widely Applicable, Tests 


Readers should be warned at once that the tests we are about to discuss are 
not recommended for general use. However, they still appear often enough in 
current literature and in current econometrics software for it to be necessary 
that practicing econometricians be familiar with them. Besides, studying 
them reveals some interesting aspects of models with serially correlated errors. 


To begin with, consider the simple regression 
Ùt = bpŭt—1 + residual, t= 1,...,n, (7.46) 


where, as above, the wt, are the residuals from regression (7.42). In order to 
be able to keep the first observation, we assume that tio = 0. This regression 
yields an estimate of b,, which we will call p because it is an estimate of p 
based on the residuals under the null. Explicitly, we have 


—1 n xx 
n~ Soper tita 
a n ~2 ? 
nt J e Ue 


where we have divided numerator and denominator by n for the purposes 
of the asymptotic analysis to follow. It turns out that, if the explanatory 
variables X in (7.42) are all exogenous, then J is a consistent estimator of the 
parameter p in model (7.40), or, equivalently, (7.41), where it is not assumed 
that p = 0. This slightly surprising result depends crucially on the assumption 
of exogenous regressors. If one of the variables in X is a lagged dependent 
variable, the result no longer holds. 


(7.47) 


p= 


Asymptotically, it makes no difference if we replace the sum in the denomina- 
tor by n7t ee ui2, because we are effectively including just one more term, 
namely, a2. Then we can write the denominator of (7.47) as n~'u'Mxu, 
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where, as usual, the orthogonal projection matrix Mx projects on to 8+(X). 
If the vector u is generated by a stationary AR(1) process, it can be shown 
that a law of large numbers can be applied to both the numerator and the 
denominator of (7.47). Thus, asymptotically, both numerator and denomina- 
tor can be replaced by their expectations. For a stationary AR(1) process, 
the covariance matrix 2 of u is given by (7.32), and so we can compute the 
expectation of the denominator as follows, making use of the invariance under 
cyclic permutations of the trace of a matrix product that was first employed 
in Section 2.6: 


E(n-'u'Mxu) = E(n~* Tr(Mxuu! 


x 
xe S 


'Tr(MxE(uu' ) 
=n 'Tr(Mx2) 
'Tr(Q) — n7'Tr(Px Q). (7.48) 


Note that, in the passage to the second line, we made use of the exogeneity 
of X, and hence of Mx. From (7.32), we see that n~'Tr() = o2/(1— p°). 
For the second term in (7.48), we have that 


Tr(Px Q) = Tr(X(X'X) X A) = Tr((n- XTX) tn 1 X'QX), 


where again we have made use of the invariance of the trace under cyclic per- 
mutations. Our usual regularity conditions tell us that both n~1X'X and 
nX! NX tend to finite limits as n — oo. Thus, on account of the extra 
factor of n~+ in front of the second term in (7.48), that term vanishes asymp- 
totically. It follows that the limit of the denominator of (7.47) is o2/(1 — p°). 


The expectation of the numerator can be handled similarly. It is convenient to 
introduce an n x n matrix L that can be thought of as the matrix expression 
of the lag operator L. All the elements of L are zero except those on the 
diagonal just beneath the principal diagonal, which are all equal to 1: 


000 -:--- 0 0 0 
100 ::- 0 0 0 
0 10 -:--- 0 0 0 
b=). : , E (7.49) 
000 -:--- 100 
0 0 0 :-- 0 1 0 


It is easy to see that (Du); = u¢_1 for t = 2,...,n, and (Lu); = 0. With this 
definition, the numerator of (7.47) becomes nla! La = ntu Mx LMxu, 
of which the expectation, by a similar argument to that used above, is 


n'E(Tr(MxLMxuu')) =n7'Tr(MxLMx2). (7.50) 
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When Mx is expressed as I — Px, the leading term in this expression is just 
Tr(LQ). By arguments similar to those used above, which readers are invited 
to make explicit in Exercise 7.8, the other terms, which contain at least one 
factor of Px, all vanish asymptotically. 


It can be seen from (7.49) that premultiplying Q by L pushes all the rows of 
N down by one row, leaving the first row with nothing but zeros, and with 
the last row of Q falling off the end and being lost. The trace of L@ is thus 
just the sum of the elements of the first diagonal of 2 above the principal 
diagonal. From (7.32), this sum is equal to n~1(n — 1)o2p/(1 — p?), which 
is asymptotically equivalent to pa2/(1 — p°). Combining this result with the 
earlier one for the denominator, we see that the limit of 6 as n — œ is just p. 
This proves our result. 


Besides providing a consistent estimator of p, regression (7.46) also yields 
a t statistic for the hypothesis that b, = 0. This t statistic provides what is 
probably the simplest imaginable test for first-order serial correlation, and it is 
asymptotically valid if the explanatory variables X are exogenous. The easiest 
way to see this is to show that the ¢ statistic from (7.46) is asymptotically 
equivalent to the t statistic for b, = 0 in the GNR. (7.43). If uw = Lu, the t 
statistic from the GNR (7.43) may be written as 


n—1/2 ù Mx tty 
s(n! ty! Mx ù)!’ 


İGNR = (7.51) 


and the t statistic from the simple regression (7.46) may be written as 


; n/a ay 
SR= 3 = 
(ntù ù)! 


(7.52) 


where s and ś are the square roots of the estimated error variances for (7.43) 
and (7.46), respectively. Of course, the factors of n in the numerators and 
denominators of (7.51) and (7.52) cancel out and may be ignored for any 
purpose except asymptotic analysis. 


Since u = Myu, it is clear that both statistics have the same numerator. 
Moreover, s and ś are asymptotically equal under the null hypothesis that 
p = 0, because (7.43) and (7.46) have the same regressand, and all the para- 
meters tend to zero as n — oo for both regressions. Therefore, the residuals, 
and so also the SSRs for the two regressions, tend to the same limits. Under 
the assumption that X is exogenous, the second factors in the denomina- 
tors can be shown to be asymptotically equal by the same sort of reasoning 
used above: Both have limits of o,. Thus we conclude that, when the null 
hypothesis is true, the test statistics tgnr and tsr are asymptotically equal. 


It is probably useful at this point to reissue a warning about the test based 
on the simple regression (7.46). It is valid only if X is exogenous. If X 
contains variables that are merely predetermined rather than exogenous, such 
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as lagged dependent variables, then the test based on the simple regression is 
not valid, although the test based on the GNR remains so. The presence of 
the projection matrix Mx in the second factor in the denominator of (7.51) 
means that this factor is always smaller than the corresponding factor in the 
denominator of (7.52). If X is exogenous, this does not matter asymptotically, 
as we have just seen. However, when X contains lagged dependent variables, 
it turns out that the limits as n — œo of tonr and tsr, under the null that 
p = 0, are the same random variable, except for a deterministic factor that is 
strictly greater for tanr than for tsr. Consequently, at least in large samples, 
tsr rejects the null too infrequently. Readers are asked to investigate this 
matter for a special case in Exercise 7.13. 


The Durbin-Watson Statistic 

The best-known test statistic for serial correlation is the d statistic proposed 
by Durbin and Watson (1950, 1951) and commonly referred to as the DW 
statistic. Like the estimate J defined in (7.47), the DW statistic is completely 
determined by the least squares residuals of the model under test: 


tN (7.53) 


nlelat+n- lùt nae +2n7 ley 


nat nae 


If we ignore the difference between n-ta'a and ntù, ù, and the term 


n—'u?, both of which clearly tend to zero as n — o0, it can be seen that the 
first term in the second line of (7.53) tends to 2 and the second term tends 
to —2p. Therefore, d is asymptotically equal to 2 — 26. Thus, in samples of 
reasonable size, a value of d = 2 corresponds to the absence of serial correlation 
in the residuals, while values of d less than 2 correspond to p > 0, and values 
greater than 2 correspond to ø < 0. Just like the t statistic tgp based on the 
simple regression (7.46), and for essentially the same reason, the DW statistic 
is not valid when there are lagged dependent variables among the regressors. 


In Section 3.6, we saw that, for a correctly specified linear regression model, 
the residual vector u is equal to Mx u. Therefore, even if the error terms are 
serially independent, the residuals will generally display a certain amount of 
serial correlation. This implies that the finite-sample distributions of all the 
test statistics we have discussed, including that of the DW statistic, depend 
on X. In practice, applied workers generally make use of the fact that the 
critical values for d are known to fall between two bounding values, dz and 
dy, which depend only on the sample size, n, the number of regressors, k, and 
whether or not there is a constant term. These bounding critical values have 
been tabulated for many values of n and k; see Savin and White (1977). 


The standard tables, which are deliberately not printed in this book, contain 
bounds for one-tailed DW tests of the null hypothesis that p < 0 against 
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the alternative that p > 0. An investigator will reject the null hypothesis if 
d < dz, fail to reject if d > dy, and come to no conclusion if dy < d < dv. 
For example, for a test at the .05 level when n = 100 and k = 8, including the 
constant term, the bounding critical values are dp = 1.528 and dy = 1.826. 
Therefore, one would reject the null hypothesis if d < 1.528 and not reject it 
if d > 1.826. Notice that, even for this not particularly small sample size, the 
indeterminate region between 1.528 and 1.826 is quite large. 


It should by now be evident that the Durbin-Watson statistic, despite its 
popularity, is not very satisfactory. Using it with standard tables is relatively 
cumbersome and often yields inconclusive results. Moreover, the standard 
tables only allow us to perform one-tailed tests against the alternative that 
p > 0. Since the alternative that p < 0 is often of interest as well, the inability 
to perform a two-tailed test, or a one-tailed test against this alternative, using 
standard tables is a serious limitation. Although exact P values for both one- 
tailed and two-tailed tests, which depend on the X matrix, can be obtained 
by using appropriate software, many computer programs do not offer this 
capability. In addition, the DW statistic is not valid when the regressors 
include lagged dependent variables, and it cannot easily be generalized to test 
for higher-order processes. Happily, the development of simulation-based tests 
has made the DW statistic obsolete. 


Monte Carlo Tests for Serial Correlation 


We discussed simulation-based tests, including Monte Carlo tests and boot- 
strap tests, at some length in Section 4.6. The techniques discussed there can 
readily be applied to the problem of testing for serial correlation in linear and 
nonlinear regression models. 


All the test statistics we have discussed, namely, tanr, tsr, and d, are pivotal 
under the null hypothesis that p = 0 when the assumptions of the classical 
normal linear model are satisfied. This makes it possible to perform Monte 
Carlo tests that are exact in finite samples. Pivotalness follows from two 
properties shared by all these statistics. The first of these is that they depend 
only on the residuals ù, obtained by estimation under the null hypothesis. 
The distribution of the residuals depends on the exogenous explanatory vari- 
ables X, but these are given and the same for all DGPs in a classical normal 
linear model. The distribution does not depend on the parameter vector 8 of 
the regression function, because, if y= X@+u, then Mxy = Mxu what- 
ever the value of the vector 8. 


The second property that all the statistics we have considered share is scale 
invariance. By this, we mean that multiplying the dependent variable by 
an arbitrary scalar leaves the statistic unchanged. In a linear regression 
model, multiplying the dependent variable by causes the residuals to be 
multiplied by A. But the statistics defined in (7.51), (7.52), and (7.53) are 
clearly unchanged if all the residuals are multiplied by the same constant, and 
so these statistics are scale invariant. Since the residuals u are equal to Mxu, 
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it follows that multiplying o by an arbitrary \ multiplies the residuals by A. 
Consequently, the distributions of the statistics are independent of g? as well 
as of 3. This implies that, for the classical normal linear model, all three 
statistics are pivotal. 


We now outline how to perform Monte Carlo tests for serial correlation in the 
context of the classical normal linear model. Let us call the test statistic we 
are using T and its realized value 7. If we want to test for AR(1) errors, the 
best choice for the statistic T is the t statistic tenr from the GNR (7.43), but 
it could also be the DW statistic, the t statistic tsr from the simple regression 
(7.46), or even f itself. If we want to test for AR(p) errors, the best choice 
for r would be the F statistic from the GNR. (7.45), but it could also be the 
F statistic from a regression of ù, on u4—; through ŭt—p- 


The first step, evidently, is to compute 7. The next step is to generate B sets 
of simulated residuals and use each of them to compute a simulated test 
statistic, say 77, for j = 1,...,B. Because the parameters do not matter, 
we can simply draw B vectors už from the N(0,I) distribution and regress 
each of them on X to generate the simulated residuals Mx uj, which are then 
used to compute 7;. This can be done very inexpensively. The final step is to 
calculate an estimated P value for whatever null hypothesis is of interest. For 
example, for a two-tailed test of the null hypothesis that p = 0, the P value 


would be the proportion of the T} that exceed 7 in absolute value: 


1 B 
P= AA (7.54) 


We would then reject the null hypothesis at level a if p*(7) < a. As we saw 
in Section 4.6, such a test will be exact whenever B is chosen so that a(B+1) 
is an integer. 


Bootstrap Tests for Serial Correlation 


Whenever the regression function is nonlinear or contains lagged dependent 
variables, or whenever the distribution of the error terms is unknown, none of 
the standard test statistics for serial correlation will be pivotal. Nevertheless, 
it is still possible to obtain very accurate inferences, even in quite small sam- 
ples, by using bootstrap tests. The procedure is essentially the one described 
in the previous subsection. We still generate B simulated test statistics and 
use them to compute a P value according to (7.54) or its analog for a one- 
tailed test. For best results, the test statistic used should be asymptotically 
valid for the model that is being tested. In particular, we should avoid d and 
tsr whenever there are lagged dependent variables. 


It is extremely important to generate the bootstrap samples in such a way that 
they are compatible with the model under test. Ways of generating bootstrap 
samples for regression models were discussed in Section 4.6. If the model 
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is nonlinear or includes lagged dependent variables, we need to generate y} 
rather than just uj. For this, we need estimates of the parameters of the 
regression function. If the model includes lagged dependent variables, we 
must generate the bootstrap samples recursively, as in (4.66). Unless we are 
going to assume that the error terms are normally distributed, we should 
draw the bootstrap error terms from the EDF of the residuals for the model 
under test, after they have been appropriately rescaled. Recall that there is 
more than one way to do this. The simplest approach is just to multiply each 
residual by (n/(n — k))'/?, as in expression (4.68). 


We strongly recommend the use of simulation-based tests for serial correla- 
tion, rather than asymptotic tests. Monte Carlo tests are appropriate only 
in the context of the classical normal linear model, but bootstrap tests are 
appropriate under much weaker assumptions. It is generally a good idea to 
test for both AR(1) errors and higher-order autoregressive errors, at least 
fourth-order in the case of quarterly data, and at least twelfth-order in the 
case of monthly data. 


Heteroskedasticity-Robust Tests 


The tests for serial correlation that we have discussed are based on the assump- 
tion that the error terms are homoskedastic. When this crucial assumption is 
violated, the asymptotic distributions of all the test statistics will differ from 
whatever distributions they are supposed to follow asymptotically. However, 
as we saw in Section 6.8, it is not difficult to modify GNR-based tests to make 
them robust to heteroskedasticity of unknown form. 


Suppose we wish to test the linear regression model (7.42), in which the error 
terms are serially uncorrelated, against the alternative that the error terms 
follow an AR(p) process. Under the assumption of homoskedasticity, we could 
simply run the GNR (7.45) and use an asymptotic F test. If we let Z denote 
an n x p matrix with typical element Zt = uz_;, where any missing lagged 
residuals are replaced by zeros, this GNR can be written as 


ù = Xb + Zc + residuals. (7.55) 


The ordinary F test for c = 0 in (7.55) is not robust to heteroskedasticity, but 
a heteroskedasticity-robust test can easily be computed using the procedure 
described in Section 6.8. This procedure works as follows: 


1. Create the matrices UX and UZ by multiplying the tt? row of X and 
the tt? row of Z by ŭ for all t. 


2. Create the matrices U~!X and U~!Z by dividing the tt? row of X and 
the tt? row of Z by ŭ for all t. 


3. Regress each of the columns of U~!X and U~!Z on UX and U. Z jointly. 
Save the resulting matrices of fitted values and call them X and Z, 
respectively. 
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4. Regress u, a vector of 1s, on X. Retain the sum of squared residuals from 
this regression, and call it RSSR. Then regress ¿ on X and Z jointly, 
retain the sum of squared residuals, and call it USSR. 


5. Compute the test statistic RSSR — USSR, which will be asymptotically 
distributed as y?(p) under the null hypothesis. 


Although this heteroskedasticity-robust test is asymptotically valid, it will 
not be exact in finite samples. In principle, it should be possible to obtain 
more reliable results by using bootstrap P values instead of asymptotic ones. 
However, none of the methods of generating bootstrap samples for regression 
models that we have discussed so far (see Section 4.6) is appropriate for a 
model with heteroskedastic error terms. Several methods exist, but they are 
beyond the scope of this book, and there currently exists no method that we 
can recommend with complete confidence; see Davison and Hinkley (1997) 
and Horowitz (2001). 


Other Tests Based on OLS Residuals 


The tests for serial correlation that we have discussed in this section are by 
no means the only scale-invariant tests based on least squares residuals that 
are regularly encountered in econometrics. Many tests for heteroskedasticity, 
skewness, kurtosis, and other deviations from the NID assumption also have 
these properties. For example, consider tests for heteroskedasticity based 
on regression (7.28). Nothing in that regression depends on y except for the 
squared residuals that constitute the regressand. Further, it is clear that both 
the F statistic for the hypothesis that b = 0 and n times the centered R? are 
scale invariant. Therefore, for a classical normal linear model with X and Z 
fixed, these statistics are pivotal. Consequently, Monte Carlo tests based on 
them, in which we draw the error terms from the N(0,1) distribution, are 
exact in finite samples. 


When the normality assumption is not appropriate, we have two options. If 
some other distribution that is known up to a scale parameter is thought to be 
appropriate, we can draw the error terms from it instead of from the N(0, 1) 
distribution. If the assumed distribution really is the true one, we obtain 
an exact test. Alternatively, we can perform a bootstrap test in which the 
error terms are obtained by resampling the rescaled residuals. This is also 
appropriate when there are lagged dependent variables among the regressors. 
The bootstrap test will not be exact, but it should still perform well in finite 
samples no matter how the error terms actually happen to be distributed. 


7.8 Estimating Models with Autoregressive Errors 


If we decide that the error terms of a regression model are serially correlated, 
either on the basis of theoretical considerations or as a result of specification 
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testing, and we are confident that the regression function itself is not misspec- 
ified, the next step is to estimate a modified model which takes account of 
the serial correlation. The simplest such model is (7.40), which is the original 
regression model modified by having the error terms follow an AR(1) process. 
For ease of reference, we rewrite (7.40) here: 


Y= Xib + u, Ut = put- +E e¢ ~ UD(0,o2). (7.56) 


In many cases, as we will discuss in the next section, the best approach may 
actually be to specify a more complicated, dynamic, model for which the 
error terms are not serially correlated. In this section, however, we ignore this 
important issue and simply discuss how to estimate the model (7.56) under 
various assumptions. 


Estimation by Feasible GLS 


We have seen that, if the u, follow a stationary AR(1) process, that is, if 
le| < 1 and Var(u1) = oł = o2/(1 — p°), then the covariance matrix of 
the entire vector u is the n x n matrix (p) given in (7.32). In order to 
compute GLS estimates, we need to find a matrix W with the property that 
ww! = R-t. This property will be satisfied whenever the covariance matrix 
of W'wu is proportional to the identity matrix, which it will be if we choose Y 


in such a way that W'u = e. 


For t = 2,...,n, we know from (7.29) that 
Et = Ut — PUt-1; (7.57) 


and this allows us to construct the rows of W! except for the first row. The 
tt! row must have 1 in the tt? position, —p in the (t — 1)*' position, and 0s 
everywhere else. 


For the first row of W', however, we need to be a little more careful. Under 
the hypothesis of stationarity of u, the variance of u; is 02. Further, since 
the €; are innovations, u1 is uncorrelated with the e+ for t = 2,...,n. Thus, 
if we define £1 by the formula 


E1 = (o¢/ou)u1 = (1 — p?) Pu, (7.58) 


it can be seen that the n-vector e, with the first component ¢, defined 
by (7.58) and the remaining components €; defined by (7.57), has a covar- 
iance matrix equal to 021. 


Putting together (7.57) and (7.58), we conclude that W' should be defined 
as an n x n matrix with all diagonal elements equal to 1 except for the first, 
which is equal to (1 — pr)! 2 and all other elements equal to 0 except for 
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the ones on the diagonal immediately below the principal diagonal, which are 
equal to —p. In terms of W rather than of W', we have: 


(l—p?)/? -p 0 + 0 0 
0 1 -p 0 0 
(p) = 2 c Elh (7.59) 
0 0 0 1 -p 
0 0 0 0 1 


where the notation ¥(p) emphasizes that the matrix depends on the usually 
unknown parameter p. The calculations needed to show that the matrix WW! 
is proportional to the inverse of 2, as given by (7.32), are outlined in Exercises 
7.9 and 7.10. 


It is essential that the AR(1) parameter p either be known or be consistently 
estimable. If we know p, we can obtain GLS estimates. If we do not know it 
but can estimate it consistently, we can obtain feasible GLS estimates. For the 
case in which the explanatory variables are all exogenous, the simplest way 
to estimate p consistently is to use the estimator p from regression (7.46), 
defined in (7.47). Whatever estimate of p is used must satisfy the stationarity 
condition that |p| < 1, without which the process would not be stationary, and 
the transformation for the first observation would involve taking the square 
root of a negative number. Unfortunately, the estimator p is not guaranteed 
to satisfy the stationarity condition, although, in practice, it is very likely to 
do so when the model is correctly specified, even if the true value of p is quite 
large in absolute value. 


Whether p is known or estimated, the next step in GLS estimation is to form 
the vector W'y and the matrix W'X. It is easy to do this without having to 
store the n x n matrix W in computer memory. The first element of W'y is 
(1 - p?)'/24,, and the remaining elements have the form y — py—1. Each 
column of W'X has precisely the same form as W'y and can be calculated in 
precisely the same way. 


The final step is to run an OLS regression of W'y on W'X. This regression 
yields the (feasible) GLS estimates 


Bats = (X'Ww' x) xX wy (7.60) 
along with the estimated covariance matrix 

Var (Bars) = 2( XTP PT XY}, (7.61) 
where s? is the usual OLS estimate of the variance of the error terms. Of 
course, the estimator (7.60) is formally identical to (7.04), since (7.60) is valid 


for any W matrix. 
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Estimation by Nonlinear Least Squares 


If we ignore the first observation, then (7.56), the linear regression model 
with AR(1) errors, can be written as the nonlinear regression model (7.41). 
Since the model (7.41) is written in such a way that the error terms are inno- 
vations, NLS estimation is consistent whether the explanatory variables are 
exogenous or merely predetermined. NLS estimates can be obtained by any 
standard nonlinear minimization algorithm of the type that was discussed 
in Section 6.4, where the function to be minimized is SSR(G, p), the sum of 
squared residuals for observations 2 through n. Such procedures generally 
work well, and they can also be used for models with higher-order autoregres- 
sive errors; see Exercise 7.17. However, some care must be taken to ensure 
that the algorithm does not terminate at a local minimum which is not also 
the global minimum. There is a serious risk of this, especially for models with 
lagged dependent variables among the regressors.” 


Whether or not there are lagged dependent variables in X+, a valid estimated 
covariance matrix can always be obtained by running the GNR (6.67), which 
corresponds to the model (7.41), with all variables evaluated at the NLS 
estimates B and p. This GNR is 


yt — PY¥e-1 — XB + pX1-18 (7.62) 
= (X; — pXy_1)b + bp(yt-1 — X;_18) + residual. 

Since the OLS estimates of b and 6, will be equal to zero, the sum of squared 
residuals from regression (7.62) is simply SSR(G, ô). Therefore, the estimated 
covariance matrix Var((, p) is 


—1 


SSR(G, ô) | (X - AX) (X - PR) (X - ôXı)'û: es 


n—k—2 tt (X — pX1) ùl th 


where the n x k matrix X; has typical row X;_1, and the vector t has typical 
element y—1 — X;_18. This is the estimated covariance matrix that a good 
nonlinear regression package should print. The first factor in (7.63) is just 
the NLS estimate of oZ. The SSR is divided by n — k — 2 because there are 
k+1 parameters in the regression function, one of which is p, and we estimate 
using only n — 1 observations. 


It is instructive to compute the limit in probability of the matrix (7.63) when 
n — oo for the case in which all the explanatory variables in X; are exogenous. 
The parameters are all estimated consistently by NLS, and so the estimates 
converge to the true parameter values 39, po, and o2 as n — oo. In computing 
the limit of the denominator of the simple estimator p given by (7.47), we saw 
that n~!t,'&, tends to o2/(1 — pĝ). The limit of n~!(X — pX1)'G is the 


2 See Dufour, Gaudry, and Liem (1980) and Betancourt and Kelejian (1981). 
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same as that of n~!(X — po-X1)'a, by the consistency of ô. In addition, given 
the exogeneity of X, and thus also of Xj, it follows at once from the law of 
large numbers that n~!(X — poXı)'ûı tends to zero. Thus, in this special 
case, the asymptotic covariance matrix of n!/ 2(8 — Bo) and n1/?(6 — po) is 


2 
E 


plim +(X — po X1) (X — po X1) 9 J (7.64) 


o! o2/(1— o 


Because the two off-diagonal blocks are zero, this matrix is said to be block- 
diagonal. As can be verified immediately, the inverse of such a matrix is itself a 
block-diagonal matrix, of which each block is the inverse of the corresponding 
block of the original matrix. Thus the asymptotic covariance matrix (7.64) is 
the limit as n — oo of 


aa SmX) X-k 0 | l (7.65) 


o! 1-os 


The block-diagonality of (7.65), which holds only if everything in X; is exo- 
genous, implies that the covariance matrix of B can be estimated using the 
GNR (7.62) without the regressor corresponding to p. The estimated covar- 
iance matrix will just be (7.63) without its last row and column. It is easy to 
see that n times this matrix tends to the top left block of (7.65) as n > co. 


The lower right-hand element of the matrix (7.65) tells us that, when all the 
regressors are exogenous, the asymptotic variance of n!/?(f — po) is 1 — pẹ. 
A sensible estimate of the variance is therefore Var(p) = n™1(1 — 67). It may 
seem surprising that the variance of # does not depend on o2. However, we saw 
earlier that, with exogenous regressors, the consistent estimator p of (7.47) is 
scale invariant. The same is true, asymptotically, of the NLS estimator p, and 
so its asymptotic variance is independent of o2. 


Comparison of GLS and NLS 


The most obvious difference between estimation by GLS and estimation by 
NLS is the treatment of the first observation: GLS takes it into account, and 
NLS does not. This difference reflects the fact that the two procedures are 
estimating slightly different models. With NLS, all that is required is the 
stationarity condition that |p| < 1. With GLS, on the other hand, the error 
process must actually be stationary. Recall that the stationarity condition is 
necessary but not sufficient for stationarity of the process. A sufficient con- 
dition requires, in addition, that Var(ui) = 0? = o2/(1— p°), the stationary 
value of the variance. Thus, if we suspect that Var(u ;) 4 02, GLS estimation 
is not appropriate, because the matrix (7.32) is not the covariance matrix of 
the error terms. 


The second major difference between estimation by GLS and estimation by 
NLS is that the former method estimates @ conditional on p, while the latter 
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method estimates 8 and p jointly. Except in the unlikely case in which the 
value of p is known, the first step in GLS is to estimate p consistently. If 
the explanatory variables in the matrix X are all exogenous, there are several 
procedures that will deliver a consistent estimate of p. The weak point is 
that the estimate is not unique, and in general it is not optimal. One possible 
solution to this difficulty is to iterate the feasible GLS procedure, as suggested 
at the end of Section 7.4, and we will consider this solution below. 


A more fundamental weakness of GLS arises whenever one or more of the 
explanatory variables are lagged dependent variables, or, more generally, pre- 
determined but not exogenous variables. Even with a consistent estimator 
of p, one of the conditions for the applicability of feasible GLS, condition 
(7.23), does not hold when any elements of X; are not exogenous. It is not 
simple to see directly just why this is so, but, in the next paragraph, we will 
obtain indirect evidence by showing that feasible GLS gives an invalid estima- 
tor of the covariance matrix. Fortunately, there is not much temptation to use 
GLS if the non-exogenous explanatory variables are lagged variables, because 
lagged variables are not observed for the first observation. In all events, the 
conclusion is simple: We should avoid GLS if the explanatory variables are 
not all exogenous. 


The GLS covariance matrix estimator is (7.61), which is obtained by regressing 
W'(/)y on W'(/)X for some consistent estimate p. Since W'(p)u = e by 
construction, s? is an estimator of 02. Moreover, the first observation has no 
impact asymptotically. Therefore, the limit as n — oo of n times (7.61) is the 
matrix a4) 

o2 plim (A(x ye. ae pX:)) (7.66) 

n—- co 

In contrast, the NLS covariance matrix estimator is (7.63). With exogenous 
regressors, n times (7.63) tends to the same limit as (7.65), of which the top 
left block is just (7.66). But when the regressors are not all exogenous, the 
argument that the off-diagonal blocks of n times (7.63) tend to zero no longer 
works, and, in fact, the limits of these blocks are in general nonzero. When a 
matrix that is not block-diagonal is inverted, the top left block of the inverse 
is not the same as the inverse of the top left block of the original matrix; 
see Exercise 7.11. In fact, as readers are asked to show in Exercise 7.12, the 
top left block of the inverse is greater by a positive semidefinite matrix than 
the inverse of the top left block. Consequently, the GLS covariance matrix 
estimator underestimates the true covariance matrix asymptotically. 


NLS has only one major weak point, which is that it does not take account of 
the first observation. Of course, this is really an advantage if the error process 
satisfies the stationarity condition without actually being stationary, or if 
some of the explanatory variables are not exogenous. But with a stationary 
error process and exogenous regressors, we wish to retain the information in 
the first observation, because it appears that retaining the first observation 
can sometimes lead to a noticeable efficiency gain in finite samples. The 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


288 Generalized Least Squares and Related Topics 


reason is that the transformation for observation 1 is quite different from the 
transformation for all the other observations. In consequence, the transformed 
first observation may well be a high leverage point; see Section 2.6. This 
is particularly likely to happen if one or more of the regressors is strongly 
trending. If so, dropping the first observation can mean throwing away a lot 
of information. See Davidson and MacKinnon (1993, Section 10.6) for a much 
fuller discussion and references. 


Efficient Estimation by GLS or NLS 


When the error process is stationary and all the regressors are exogenous, it 
is possible to obtain an estimator with the best features of GLS and NLS by 
modifying NLS so that it makes use of the information in the first observation 
and therefore yields an efficient estimator. The first-order conditions (7.07) 
for GLS estimation of the model (7.56) can be written as 


X'wWw'(y — XB) =0. 


Using (7.59) for W, we see that these conditions are 


n 


do (Xt = pX)" (ye — XeB — p(y — X1-18)) (7.67) 
a + (1— p?)Xi"(y1 — X18) = 0. | 


With NLS estimation, the first-order conditions that define the NLS estimator 
are the conditions that the regressors in the GNR (7.62) should be orthogonal 
to the regressand: 


n 


Yo (X — pX-1)" (v — XB — p(y-1 — Xi-1B)) = 0, and 


t=2 


(7.68) 


n 


S (i — X;_1) (y: — Xb — plea = Xı-18)) = 0. 


t=2 


For given 8, the second of the NLS conditions can be solved for p. If we write 
u(3) = y — XB, and u;(B) = Lu(B), where L is the matrix lag operator 
defined in (7.49), we see that 


(7.69) 


This formula is similar to the estimator (7.47), except that B may take on 
any value instead of just 8. 


In Section 7.4, we mentioned the possibility of using an iterated feasible GLS 
procedure. We can now see precisely how such a procedure would work for 
this model. In the first step, we obtain the OLS parameter vector 8. In the 
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second step, the formula (7.69) is evaluated at B = B to obtain J, a consistent 
estimate of p. In the third step, we use (7.60) to obtain the feasible GLS 
estimate Br, thus solving the first-order conditions (7.67). At this point, we 
go back to the second step and insert Gp into (7.69) for an updated estimate 
of p, which we subsequently use in (7.60) for the next estimate of 8. The 
iterative procedure may then be continued until convergence, assuming that 
it does converge. If so, then the final estimates, which we will call B and Ô, 


must satisfy the two equations 
n 


NO (X: — 0X11)" (u — XB — ôlu-1 — Xı-18)) 


t=2 
_ 72 T = A\ — 
aE (1 P X (yı Xıß) = 0, and (7.70) 


n 


X (u — X,-18) (y: = XB = P(Yt-1 ca X;_18)) = 0. 


t=2 


These conditions are identical to conditions (7.68), except for the term in the 
first condition coming from the first observation. Thus we see that iterated 
feasible GLS, without the first observation, is identical to NLS. If the first 
observation is retained, then iterated feasible GLS improves on NLS by taking 
account of the first observation. 


We can also modify NLS to take account of the first observation. To do this, 
we extend the GNR (6.67), which is given by (7.62) when evaluated at 8 
and p, by giving it a first observation. For this observation, the regressand 
is (1 — p?)'/?(y, — X13), the regressors corresponding to B are given by the 
row vector (1 — poA 2X1, and the regressor corresponding to p is zero. The 
conditions that the extended regressand should be orthogonal to the extended 
regressors are exactly the conditions (7.70). 


Two asymptotically equivalent procedures can be based on this extended 
GNR. Both begin by obtaining the NLS estimates of B and p without the 
first observation and evaluating the extended GNR at those preliminary NLS 
estimates. The OLS estimates from the extended GNR can be thought of as 
a vector of corrections to the initial estimates. For the first procedure, the 
final estimator is a one-step estimator, defined as in (6.59) by adding the cor- 
rections to the preliminary estimates. For the second procedure, this process 
is iterated. The variables of the extended GNR are evaluated at the one-step 
estimates, another set of corrections is obtained, these are added to the pre- 
vious estimates, and iteration continues until the corrections are negligible. If 
this happens, the iterated estimates once more satisfy the conditions (7.70), 
and so they are equal to the iterated GLS estimates. 


Although the iterated feasible GLS estimator generally performs well, it does 
have one weakness: There is no way to ensure that |p| < 1. In the unlikely 
but not impossible event that |A| > 1, the estimated covariance matrix (7.61) 
will not be valid, the second term in (7.67) will be negative, and the first 
observation will therefore tend to have a perverse effect on the estimates of 8. 
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In Chapter 10, we will see that maximum likelihood estimation shares the 
good properties of iterated feasible GLS while also ensuring that the estimate 
of p satisfies the stationarity condition. 


The iterated feasible GLS procedure considered above has much in common 
with a very old, but still widely-used, algorithm for estimating models with 
stationary AR(1) errors. This algorithm, which is called iterated Cochrane- 
Orcutt, was originally proposed in a classic paper by Cochrane and Orcutt 
(1949). It works in exactly the same way as iterated feasible GLS, except that 
it omits the first observation. The properties of this algorithm are explored 
in Exercises 7.18-19. 


7.9 Specification Testing and Serial Correlation 


Models estimated using time-series data frequently appear to have error terms 
which are serially correlated. However, as we will see, many types of misspec- 
ification can create the appearance of serial correlation. Therefore, finding 
evidence of serial correlation does not mean that it is necessarily appropriate 
to model the error terms as following some sort of autoregressive or moving 
average process. If the regression function of the original model is misspecified 
in any way, then a model like (7.41), which has been modified to incorporate 
AR(1) errors, will probably also be misspecified. It is therefore extremely 
important to test the specification of any regression model that has been 
“corrected” for serial correlation. 


The Appearance of Serial Correlation 


There are several types of misspecification of the regression function that can 
incorrectly create the appearance of serial correlation. For instance, it may be 
that the true regression function is nonlinear in one or more of the regressors 
while the estimated one is linear. In that case, depending on how the data 
are ordered, the residuals from a linear regression model may well appear to 
be serially correlated. All that is needed is for the independent variables on 
which the dependent variable depends nonlinearly to be correlated with time. 


As a concrete example, consider Figure 7.1, which shows 200 hypothetical 
observations on a regressor x and a regressand y, together with an OLS re- 
gression line and the fitted values from the true, nonlinear model. For the 
linear model, the residuals are always negative for the smallest and largest 
values of x, and they tend to be positive for the intermediate values. As a 
consequence, they appear to be serially correlated: If the observations are 
ordered according to the value of x, the estimate p obtained by regressing the 
OLS residuals on themselves lagged once is 0.298, and the t statistic for p = 0 
is 4.462. Thus, if the data are ordered in this way, there appears to be strong 
evidence of serial correlation. But this evidence is misleading. Either plotting 
the residuals against x or including x? as an additional regressor will quickly 
reveal the true nature of the misspecification. 
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Regression line for linear model 


Fitted values for true model «7 


T 


Figure 7.1 The appearance of serial correlation 


The true regression function in this example contains a term in x”. Since 


the linear model omits this term, it is underspecified, in the sense discussed 
in Section 3.7. Any sort of underspecification has the potential to create 
the appearance of serial correlation if the incorrectly omitted variables are 
themselves serially correlated. Therefore, whenever we find evidence of serial 
correlation, our first reaction should be to think carefully about the specifica- 
tion of the regression function. Perhaps one or more additional independent 
variables should be included among the regressors. Perhaps powers, cross- 
products, or lags of some of the existing independent variables need to be 
included. Or perhaps the regression function should be made dynamic by 
including one or more lags of the dependent variable. 


Common Factor Restrictions 


It is very common for linear regression models to suffer from dynamic mis- 
specification. The simplest example is failing to include a lagged dependent 
variable among the regressors. More generally, dynamic misspecification oc- 
curs whenever the regression function incorrectly omits lags of the dependent 
variable or of one or more independent variables. A somewhat mechanical, 
but often very effective, way to detect dynamic misspecification in models 
with autoregressive errors is to test the common factor restrictions that are 
implicit in such models. The idea of testing these restrictions was initially pro- 
posed by Sargan (1964) and further developed by Hendry and Mizon (1978), 
Mizon and Hendry (1980), Sargan (1980), and others. See Hendry (1995) for 
a detailed treatment of dynamic specification in linear regression models. 
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The easiest way to understand what common factor restrictions are and how 
they got their name is to consider a linear regression model with errors that 
apparently follow an AR(1) process. In this case, there are really three nested 
models. The first of these is the original linear regression model with error 
terms that are assumed to be serially independent: 


Ho: y= Xib +u, um ~ ID(0,o?). (7.71) 


The second is the nonlinear model (7.41) that is obtained when the error 
terms in (7.71) follow the AR(1) process (7.29). Although we have already 
discussed this model extensively, we rewrite it here for convenience: 


Hi: y= ph-1 + XB —pX1B8+e, cr ~ ID(0,02). (7.72) 


The third is the linear model that can be obtained by relaxing the nonlinear 
restrictions which are implicit in (7.72). This model is 


Ho: Yi = pyt-1 + Xib + Xt-1Yy tet, a~ IID(0, o2), (7.73) 


where y, like 8, is a k-vector. When all three of these models are estimated 
over the same sample period, the original model, Ho, is a special case of the 
nonlinear model Hı, which in turn is a special case of the unrestricted linear 
model Hə. Of course, in order to estimate Hı and H2, we need to drop the 
first observation. 


The nonlinear model Hı imposes on Hə the restrictions that y = —p@. The 
reason for calling these restrictions “common factor” restrictions can easily be 
seen if we rewrite both models using lag operator notation (see Section 7.6). 
When we do this, Hı becomes 


(1 — pL)y = (1 — pL) XB + et, (7.74) 


and Hə becomes 
(1— pL)y, = Xib + LXiy + €. (7.75) 


It is evident that in (7.74), but not in (7.75), the common factor 1 — pL 
appears on both sides of the equation. This is where the term “common 
factor restrictions” comes from. 


How Many Common Factor Restrictions Are There? 


There is one feature of common factor restrictions that can be tricky: It is 
often not obvious just how many restrictions there are. For the case of testing 
Hı against H2, there appear to be k restrictions. The null hypothesis, Hy, 
has k + 1 parameters (the k-vector 6 and the scalar p), and the alternative 
hypothesis, Hə, seems to have 2k + 1 parameters (the k-vectors 6 and y, 
and the scalar p). Therefore, the number of restrictions appears to be the 
difference between 2k +1 and k +1, which is k. In fact, however, the number 
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of restrictions will almost always be less than k, because, except in rare cases, 
the number of identifiable parameters in Hə will be less than 2k +1. We now 
show why this is the case. 


Let us consider a simple example. Suppose the regression function for the 
original model Ho is 


Bı + Poze + B3t + Gaze-1 + Gsye-1, (7.76) 


th tte 


where z+ is the observation on some independent variable, and t is the 
observation on a linear time trend. The regression function for the unrestricted 
model Hə that corresponds to (7.76) is 


Bı + Boze + Bat + Bazt—1 + Osye-1 + PYt-1 


CEO 
+9144 Y2zt—1 + y3(t — 1) + 42-2 + Y5Yt-2- 

At first glance, this regression function appears to have 11 parameters. How- 
ever, it really has only 7, because 4 of them are unidentifiable. We cannot 
estimate both 6ı and q1, because there cannot be two constant terms. Like- 
wise, we cannot estimate both 84 and 72, because there cannot be two coef- 
ficients of z—1, and we cannot estimate both 85 and p, because there cannot 
be two coefficients of y+—1. We also cannot estimate y3 along with 8, and 
the constant, because t, t — 1, and the constant term are perfectly collinear, 
since t — (t — 1) = 1. The version of Hə that can actually be estimated has 
regression function 


Ôi + Boz + dot + ð32t—1 + Ô4Yt—1 + Ya2%e-2 + Y5Yt—2, (7.78) 
where 


ôi = b1 +71 — V3, 62 = 63+ 73, 63 = ba + %2, and d4 = p + fs. 


We see that (7.78) has only 7 identifiable parameters: (2, ya, Ys, 01, Ô2, 
63, and 64, instead of the 11 parameters, many of them not identifiable, of 
expression (7.77). In contrast, the regression function for the restricted model, 
Hı, has 6 parameters: 3, through 8s, and p. Therefore, in this example, Hı 
imposes just one restriction on Ho. 


The phenomenon illustrated in this example arises, to a greater or lesser 
extent, for almost every model with common factor restrictions. Constant 
terms, many types of dummy variables (notably, seasonal dummies and time 
trends), lagged dependent variables, and independent variables that appear 
with more than one time subscript always lead to an unrestricted model Hə 
with some parameters that cannot be identified. The number of identifiable 
parameters will almost always be less than 2k + 1, and, in consequence, the 
number of restrictions will almost always be less than k. 
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Testing Common Factor Restrictions 


Any of the techniques discussed in Sections 6.7 and 6.8 can be used to test 
common factor restrictions. In practice, if the error terms are believed to be 
homoskedastic, the easiest approach is probably to use an asymptotic F test. 
For the example of equations (7.72) and (7.73), the restricted sum of squared 
residuals, RSSR, is obtained from NLS estimation of Hı, and the unrestricted 
one, USSR, is obtained from OLS estimation of Hə. Then the test statistic is 


(RSSR — USSR)/r 


2 iin bee 
Sry ne lc a ve) 


where r is the number of restrictions. The number of degrees of freedom in 
the denominator reflects the fact that the unrestricted model has k + r+ 1 
parameters and is estimated using the n — 1 observations for t = 2,...,n. 


Of course, since both the null and alternative models involve lagged dependent 
variables, the test statistic (7.79) does not actually follow the F(r,n—k—r—2) 
distribution in finite samples. Therefore, when the sample size is not large, 
it is a good idea to bootstrap the test. As Davidson and MacKinnon (1999a) 
have shown, highly reliable P values may be obtained in this way, even for 
very small sample sizes. The bootstrap samples are generated recursively from 
the restricted model, Hı, using the NLS estimates of that model. As with 
bootstrap tests for serial correlation, the bootstrap error terms may either be 
drawn from the normal distribution or obtained by resampling the rescaled 
NLS residuals; see the discussion in Sections 4.6 and 7.7. 


Although this bootstrap procedure is conceptually simple, it may be quite 
expensive to compute, because the nonlinear model (7.72) must be estimated 
for every bootstrap sample. It may therefore be more attractive to follow the 
idea in Exercises 6.17 and 6.18 by bootstrapping a GNR-based test statistic 
that requires no nonlinear estimation at all. For the Hı model (7.72), the 
corresponding GNR is (7.62), but now we wish to evaluate it, not at the NLS 
estimates from (7.72), but at the estimates 8 and ó obtained by estimating 
the linear Hz model (7.73). These estimates are root-n consistent under Ho, 
and so also under Hı, which is contained in Hə as a special case. Thus the 
GNR for Hı, which was introduced in Section 6.6, is 


Yt — ÉY- — Xib + Xi- (7.80) 

= (Xt E pX1-1)b + bp (Yt—1 = X+_1/3) + residual. 
Since Hə is a linear model, the regressors of the GNR that corresponds to it 
are just the regressors in (7.73), and the regressand is the same as in (7.80); 
recall Section 6.5. However, in order to construct the GNR-based F statistic, 
which has exactly the same form as (7.79), it is not necessary to run the 
GNR for model Hp at all. Since the regressand of (7.80) is just the dependent 
variable of (7.73) plus a linear combination of the independent variables, the 
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residuals from (7.73) are the same as those from its GNR. Consequently, we 
can evaluate (7.79) with USSR from (7.73) and RSSR from (7.80). 


In Section 6.6, we gave the impression that 3 and / are simply the OLS es- 
timates of 8 and p from (7.73). When X contains neither lagged dependent 
variables nor multiple lags of any independent variable, this is true. How- 
ever, when these conditions are not satisfied, the parameters of (7.73) do not 
correspond directly to those of (7.72), and this makes it a little more compli- 
cated to obtain consistent estimates of these parameters. Just how to do so 
was discussed in Section 10.3 of Davidson and MacKinnon (1993) and will be 
illustrated in Exercise 7.16. 


Tests of Nested Hypotheses 


The models Ho, Hı, and Hə defined in (7.71) through (7.73) form a sequence 
of nested hypotheses. Such sequences occur quite frequently in many branches 
of econometrics, and they have an interesting property. Asymptotically, the F 
statistic for testing Hp against H; is independent of the F statistic for testing 
A, against Hə. This is true whether we actually estimate Hı or merely use 
a GNR, and it is also true for other test statistics that are asymptotically 
equivalent to F statistics. In fact, the result is true for any sequence of nested 
hypotheses where the test statistics follow x? distributions asymptotically; see 
Davidson and MacKinnon (1993, Supplement) and Exercise 7.21. 


The independence property of tests in a nested sequence has a useful impli- 
cation. Suppose that 7;; denotes the statistic for testing H;, which has k; 
parameters, against H;, which has kj > ki parameters, where 7 = 0,1 and 
j = 1,2, with j > i. Then, if each of the test statistics is asymptotically 
distributed as x*(k; — ki), 


To2 = Tor + T12- (7.81) 


This result implies that, at least asymptotically, each of the component test 
statistics is bounded above by the test statistic for Hp against Hə. 


The result (7.81) is not particularly useful in the case of (7.71), (7.72), and 
(7.73), where all of the test statistics are quite easy to compute. However, it 
can sometimes come in handy. Suppose, for example, that it is easy to test 
Ho against Hə but hard to test Ho against Hı. Then, if To2 is small enough 
that it would not cause us to reject Hp against Hı when compared with the 
appropriate critical value for the y? (kı — ko) distribution, we do not need to 
bother calculating 7 1, because it will be even smaller. 
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7.10 Models for Panel Data 


Many data sets are measured across two dimensions. One dimension is time, 
and the other is usually called the cross-section dimension. For example, we 
may have 40 annual observations on 25 countries, or 100 quarterly observations 
on 50 states, or 6 annual observations on 3100 individuals. Data of this type 
are often referred to as panel data. It is likely that the error terms for a model 
using panel data will display certain types of dependence, which should be 
taken into account when we estimate such a model. 


For simplicity, we restrict our attention to the linear regression model 
Yit = Xub + Uit, GST, sag MM; am eee ae (7.82) 


where Xx is a 1 x k vector of observations on explanatory variables. There 
are assumed to be m cross-sectional units and T time periods, for a total 
of n = mT observations. If each u;; has expectation zero conditional on its 
corresponding Xj, we can estimate equation (7.82) by ordinary least squares. 
But the OLS estimator is not efficient if the u;, are not IID, and the IID 
assumption is rarely realistic with panel data. 


If certain shocks affect the same cross-sectional unit at all points in time, 
the error terms u; and u;, will be correlated for all t Æ s. Similarly, if 
certain shocks affect all cross-sectional units at the same point in time, the 
error terms uj and uje will be correlated for all 1 # j. In consequence, if 
we use OLS, not only will we obtain inefficient parameter estimates, but we 
will also obtain an inconsistent estimate of their covariance matrix; recall 
the discussion of Section 5.5. If the expectation of uji conditional on Xj; is 
not zero, then, for reasons mentioned in Section 7.4, OLS will actually yield 
inconsistent parameter estimates. This will happen, for example, when X;; 
contains lagged dependent variables and the u;; are serially correlated. 


Error-Components Models 


The two most popular approaches for dealing with panel data are both based 
on what are called error-components models. The idea is to specify the error 
term uiz in (7.82) as consisting of two or three separate shocks, each of which 
is assumed to be independent of the others. A fairly general specification is 


tit = Ct + UG tEn: (7.83) 


Here e; affects all observations for time period t, v; affects all observations 
for cross-sectional unit i, and €; affects only observation it. It is gener- 
ally assumed that the e, are independent across t, the v; are independent 
across i, and the £; are independent across all 7 and t. Classic papers on error- 
components models include Balestra and Nerlove (1966), Fuller and Battese 
(1974), and Mundlak (1978). 
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In order to estimate an error-components model, the e; and v; can be regarded 
as being either fixed or random, in a sense that we will explain. If the e 
and v; are thought of as fixed effects, then they are treated as parameters 
to be estimated. It turns out that they can then be estimated by OLS using 
dummy variables. If they are thought of as random effects, then we must 
figure out the covariance matrix of the u;i as functions of the variances of 
the e, vi, and £;i, and use feasible GLS. Each of these approaches can be 
appropriate in some circumstances but may be inappropriate in others. 


In what follows, we simplify the error-components specification (7.83) by elim- 
inating the e;. Thus we assume that there are shocks specific to each cross- 
sectional unit, or group, but no time-specific shocks. This assumption is often 
made in empirical work, and it considerably simplifies the algebra. In addi- 
tion, we assume that the Xj, are exogenous. The presence of lagged dependent 
variables in panel data models raises a number of issues that we do not wish 
to discuss here; see Arellano and Bond (1991) and Arellano and Bover (1995). 


Fixed-Effects Estimation 


The model that underlies fixed-effects estimation, based on equation (7.82) 
and the simplified version of equation (7.83), can be written as follows: 


y = XB + Dn+e, E(ee')=07In, (7.84) 


where y and e are n-vectors with typical elements y;; and Eit, respectively, 
and D is an n x m matrix of dummy variables, constructed in such a way 
that the element in the row corresponding to observation it, for i = 1,...,m 
and t = 1,...,T, and column j, for j = 1,...,m, is equal to 1 ifi = j 
and equal to 0 otherwise.? The m-vector 7 has typical element v;, and so 
it follows that the n-vector Dn has element v; in the row corresponding to 
observation it. Note that there is exactly one element of D equal to 1 in each 
row, which implies that the n-vector e with each element equal to 1 is a linear 
combination of the columns of D. Consequently, in order to avoid collinear 
regressors, the matrix X should not contain a constant. 


The vector 7 plays the role of a parameter vector, and it is in this sense that 
the v; are called fixed effects. They could in fact be random; the essential thing 
is that they must be independent of the error terms ci. They may, however, 
be correlated with the explanatory variables in the matrix X. Whether or 
not this is the case, the model (7.84), interpreted conditionally on 7, implies 
that the moment conditions 


E(Xit (Yit — XixB—vi)) =O and E(ye — Xup -— vi) = 0 


3 Tf the data are ordered so that all the observations in the first group appear 
first, followed by all the observations in the second group, and so on, the row 
corresponding to observation it will be row T(z — 1) +t. 
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are satisfied. The fixed-effects estimator, which is the OLS estimator of G 
in equation (7.84), is based on these moment conditions. Because of the way 
it is computed, this estimator is sometimes called the least squares dummy 
variables, or LSDV, estimator. 


Let Mp denote the projection matrix I — D(D'D)~!D". Then, by the FWL 
Theorem, we know that the OLS estimator of 8 in (7.84) can be obtained 
by regressing Mpy, the residuals from a regression of y on D, on MpX, 
the matrix of residuals from regressing each of the columns of X on D. The 
fixed-effects estimator is therefore 


Bre = (X'MpX)'X'Mpy. (7.85) 


For any n-vector x, let z; denote the group mean i ne £i. Then it 
is easy to check that element it of the vector Mpz is equal to £i — Ti, 
the deviation from the group mean. Since all the variables in (7.85) are 
premultiplied by Mp, it follows that this estimator makes use only of the 
information in the variation around the mean for each of the m groups. For 
this reason, it is often called the within-groups estimator. Because X and D 
are exogenous, this estimator is unbiased. Moreover, since the conditions of 
the Gauss-Markov theorem are satisfied, we can conclude that the fixed-effects 
estimator is BLUE. 


The fixed-effects estimator (7.85) has advantages and disadvantages. It is 
easy to compute, even when m is very large, because it is never necessary to 
make direct use of the n x n matrix Mp. All that is needed is to compute 
the m group means for each variable. In addition, the estimates 7 of the fixed 
effects may well be of interest in their own right. However, the estimator 
cannot be used with an explanatory variable that takes on the same value for 
all the observations in each group, because such a column would be collinear 
with the columns of D. More generally, if the explanatory variables in the 
matrix X are well explained by the dummy variables in D, the parameter 
vector 8B will not be estimated at all precisely. It is of course possible to 
estimate a constant, simply by taking the mean of the estimates 17. 


Random-Effects Estimation 


It is possible to improve on the efficiency of the fixed-effects estimator if one 
is willing to impose restrictions on the model (7.84). For that model, all we 
require is that the matrix X of explanatory variables and the cross-sectional 
errors v; should both be independent of the £;i, but this does not rule out 
the possibility of a correlation between them. The restrictions imposed for 
random-effects estimation require that the v; should be independent of X. 


This independence assumption is by no means always plausible. For example, 
in a panel of observations on individual workers, an observed variable like 
the hourly wage rate may well be correlated with an unobserved variable 
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like ability, which implicitly enters into the individual-specific error term v;i. 
However, if the assumption is satisfied, it follows that 


E(u |X) = E(u, + ex |X) = 0, (7.86) 


since v; and €;, are then both independent of X. Condition (7.86) is precisely 
the condition which ensures that OLS estimation of the model (7.82), rather 
than the model (7.84), will yield unbiased estimates. 


However, OLS estimation of equation (7.82) is not in general efficient, because 
the u;i are not IID. We can calculate the covariance matrix of the u;i: if we 
assume that the v; are IID random variables with mean zero and variance a, 
This assumption accounts for the term “random” effects. From (7.83), setting 
ep = 0 and using the assumption that the shocks are independent, it is easy 
to see that 
Var (uit) = 02 + 02, 
Cov(uit tis) = 0%, and 
Cov(uitujs) = 0 for all iF j. 


These define the elements of the n x n covariance matrix (2, which we need 
for GLS estimation. If the data are ordered by the cross-sectional units in 
m blocks of T observations each, this matrix has the form 


yo- 0 
(0) DS. Hes (0) 
(0) 0 nee | 
where 
X = oir + o2 u' (7.87) 


is the T x T matrix with o? + o? in every position on the principal diagonal 
and o? everywhere else. Here ų is a T-vector of Is. 


To obtain GLS estimates of 3, we would need to know the values of of and 02, 
or, at least, the value of their ratio, since, as we saw in Section 7.3, GLS 
estimation requires only that (2 should be specified up to a factor. To obtain 
feasible GLS estimates, we need a consistent estimate of that ratio. However, 
the reader may have noticed that we have made no use in this section so far 
of asymptotic concepts, such as that of a consistent estimate. This is because, 
in order to obtain definite results, we must specify what happens to both m 
and T when n = mT tends to infinity. 


Consider the fixed-effects model (7.84). If m remains fixed as T — oo, then the 
number of regressors also remains fixed as n — oo, and standard asymptotic 
theory applies. But if T remains fixed as m — oo, then the number of 
parameters to be estimated tends to infinity, and the m-vector 7 of estimates 
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of the fixed effects is not consistent, because each estimated effect depends 
only on T observations. It is nevertheless possible to show that, even in this 
case, B remains consistent; see Exercise 7.23. 


It is always possible to find a consistent estimate of o2 by estimating the 
model (7.84), because, no matter how m and T may behave as n — oo, there 
are n residuals. Thus, if we divide the SSR from (7.84) by n — m — k, we will 
obtain an unbiased and consistent estimate of a2, since the error terms for this 
model are just the ei. But the natural estimator of o2, namely, the sample 
variance of the m elements of 7), is not consistent unless m — oo. In practice, 
therefore, it is probably undesirable to use the random-effects estimator when 


m is small. 


There is another way to estimate o? consistently if m — oo as n — oo. One 
starts by running the regression 


Ppy = PpXB + residuals, (7.88) 
where Pp = I — Mp, so as to obtain the between-groups estimator 
Baa = (X'PpX)'X'Ppy. (7.89) 


Although regression (7.88) appears to have n = mT observations, it really has 
only m, because the regressand and all the regressors are the same for every 
observation in each group. The estimator bears the name “between-groups” 
because it uses only the variation among the group means. If m < k, note 
that the estimator (7.89) does not even exist, since the matrix X'PpX can 
have rank at most m. 


If the restrictions of the random-effects model are not satisfied, the estimator 
Bsc, if it exists, is in general biased and inconsistent. To see this, observe 
that unbiasedness and consistency require that the moment conditions 


E((PpX)i:(yie — Xaß)) = 0 (7.90) 


should hold, where (Pp X)jz is the row labelled it of the n x k matrix PpX. 
Since Yit — Xub = vi + Eiu, and since € is independent of everything else 
in condition (7.90), this condition is equivalent to the absence of correlation 
between the v; and the elements of the matrix X. 


As readers are asked to show in Exercise 7.24, the variance of the error terms 
in regression (7.88) is oł + 02/T. Therefore, if we run it as a regression 
with m observations, divide the SSR by m — k, and then subtract 1/T times 
our estimate of 02, we will obtain a consistent, but not necessarily positive, 
estimate of o2. If the estimate turns out to be negative, we probably should 


not be estimating an error-components model. 


As we will see in the next paragraph, both the OLS estimator of model (7.82) 
and the feasible GLS estimator of the random-effects model are matrix- 
weighted averages of the within-groups, or fixed-effects, estimator (7.85) and 
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the between-groups estimator (7.89). For the former to be consistent, we need 
only the assumptions of the fixed-effects model, but for the latter we need in 
addition the restrictions of the random-effects model. Thus both the OLS 
estimator of (7.82) and the feasible GLS estimator are consistent only if the 
between-groups estimator is consistent. 


For the OLS estimator of (7.82), 
B= (X'X)"X'y 
= (X'X)"'(X'Mpy+ X'Ppy) 
= (X'X)-1X™MpXGrr + (X'X)1X'PpXBzc, 


which shows that the estimator is indeed a matrix-weighted average of Brn 
and Bge. As readers are asked to show in Exercise 7.25, the GLS estimator 
of the random-effects model can be obtained by running the OLS regression 


(I—APp)y = (I — APp) XB + residuals, (7.91) 
where the scalar A is defined by 
To? —1/2 
s1- (2 +1) ; (7.92) 


For feasible GLS, we need to replace g2 and o? by the consistent estimators 
that were discussed earlier in this subsection. 


Equation (7.91) implies that the random-effects GLS estimator is a matrix- 
weighted average of the OLS estimator for equation (7.82) and the between- 
groups estimator, and thus also of Bre and Bac. The GLS estimator is 
identical to the OLS estimator when A = 0, which happens when o? = 0, 
and equal to the within-groups, or fixed-effects, estimator when A = 1, which 
happens when g2 = 0. Except in these two special cases, the GLS estimator 
is more efficient, in the context of the random-effects model, than either the 
OLS estimator or the fixed-effects estimator. But equation (7.91) also implies 
that the random-effects estimator is inconsistent whenever the between-groups 
estimator is inconsistent. 


Unbalanced Panels 


Up to this point, we have assumed that we are dealing with a balanced panel, 
that is, a data set for which there are precisely T observations for each cross- 
sectional unit. However, it is quite common to encounter unbalanced panels, 
for which the number of observations is not the same for every cross-sectional 
unit. The fixed-effects estimator can be used with unbalanced panels without 
any real change. It is still based on regression (7.84), and the only change is 
that the matrix of dummy variables D will no longer have the same number 
of 1s in each column. The random-effects estimator can also be used with 
unbalanced panels, but it needs to be modified slightly. 
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Let us assume that the data are grouped by cross-sectional units. Let T; 
denote the number of observations associated with unit 7, and partition y and 
X as follows: 


y=(yiiyoi-:i Ym), X =[X1i X2i--- Xm], 


where y; and X; denote the T; rows of y and X that correspond to the i** 


unit. By analogy with (7.92), make the definition 


Let y; denote a T;-vector, each element of which is the mean of the elements 
of yi. Similarly, let X; denote a T; x k matrix, each element of which is the 
mean of the corresponding column of X;. Then the random-effects estimator 
can be computed by running the linear regression 


yi — AIM My MX, 
= mA = = a B + residuals. (7.93) 
Ym — eat Ay = Nn Xm 
Note that Ppy is just [yi | Yo i +--+! Ym], and similarly for Pp X. Therefore, 


since all the A; are equal to A when the panel is balanced, regression (7.93) 
reduces to regression (7.91) in that special case. 


Group Effects and Individual Data 


Error-components models are also relevant for regressions on cross-section 
data with no time dimension, but where the observations naturally belong to 
groups. For example, each observation might correspond to a household living 
in a certain state, and each group would then consist of all the households 
living in a particular state. In such cases, it is plausible that the error terms for 
individuals within the same group are correlated. An error-components model 
that combines a group-specific error v;, with variance oĉ, and an individual- 
specific error Ei, with variance o2, is a natural way to model this sort of 
correlation. Such a model implies that the correlation between the error terms 
for observations in the same group is p = 0?/(o0? + 02) and the correlation 
between the error terms for observations in different groups is zero. 


A fixed-effects model is often unsatisfactory for dealing with group effects. In 
many cases, some explanatory variables are observed only at the group level, 
so that they have no within-group variation. Such variables are perfectly 
collinear with the group dummies used in estimating a fixed-effects model, 
making it impossible to identify the parameters associated with them. On the 
other hand, they are identified by a random-effects model for an unbalanced 
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panel, because this model takes account of between-group variation. This 
can be seen from equation (7.93): Collinearity of the transformed group-level 
variables on the right-hand side occurs only if the explanatory variables are 
collinear to begin with. The estimates of o? and o? needed to compute the 
A; may be obtained in various ways, some of which were discussed in the 
subsection on random-effects estimation. As we remarked there, these work 
well only if the number of groups m is not too small. 


If it is thought that the within-group correlation p is small, it may be tempting 
to ignore it and use OLS estimation, with the usual OLS covariance matrix. 
This can be a serious mistake unless p is actually zero, since the OLS stan- 
dard errors can be drastic underestimates even with small values of p, as 
Kloek (1981) and Moulton (1986, 1990) have pointed out. The problem is 
particularly severe when the number of observations per group is large, as 
readers are asked to show in Exercise 7.26. The correlation of the error terms 
within groups means that the effective sample size is much smaller than the 
actual sample size when there are many observations per group. 


In this section, we have presented just a few of the most basic ideas concerning 
estimation with panel data. Of course, GLS is not the only method that can 
be used to estimate models for data of this type. The generalized method of 
moments (Chapter 9) and the method of maximum likelihood (Chapter 10) 
are also commonly used. For more detailed treatments of various models 
for panel data, see, among others, Chamberlain (1984), Hsiao (1986, 2001), 
Baltagi (1995), Greene (2000, Chapter 14), Ruud (2000, Chapter 24), Arellano 
and Honoré (2001), and Wooldridge (2001). 


7.11 Final Remarks 


Several important concepts were introduced in the first four sections of this 
chapter, which dealt with the basic theory of generalized least squares esti- 
mation. The concept of an efficient MM estimator, which we introduced in 
Section 7.2, will be encountered again in the context of generalized instru- 
mental variables estimation (Chapter 8) and generalized method of moments 
estimation (Chapter 9). The key idea of feasible GLS estimation, namely, that 
an unknown covariance matrix may in some circumstances be replaced by a 
consistent estimate of that matrix without changing the asymptotic properties 
of the resulting estimator, will also be encountered again in Chapter 9. 


The remainder of the chapter dealt with the treatment of heteroskedasticity 
and serial correlation in linear regression models, and with error-components 
models for panel data. Although this material is of considerable practical 
importance, most of the techniques we discussed, although sometimes compli- 
cated in detail, are conceptually straightforward applications of feasible GLS 
estimation, NLS estimation, and methods for testing hypotheses that were 
introduced in Chapters 4 and 6. 
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7.12 


il 


7.2 


7.3 


7.4 


7.5 


7.6 


7.7 


7.8 


Generalized Least Squares and Related Topics 


Exercises 


Using the fact that E(uu' |X) = RQ for regression (7.01), show directly, 
without appeal to standard OLS results, that the covariance matrix of the 
GLS estimator Bars is given by (7.05). 


Show that the matrix (7.11), reproduced here for easy reference, 
x'a-'x —- x'w(w'ew)'w'x, 


is positive semidefinite. As in Section 6.2, this may be done by showing that 
this matrix can be expressed in the form Z'MZ, for some n x k matrix Z 


and some n x n orthogonal projection matrix M. It is helpful to express Q J: 
as WW! as in (7.02). 


Using the data in the file earnings.data, run the regression 


yt = Pidit + Bodazr + P3dzt + ut, 


which was previously estimated in Exercise 5.3. Recall that the dit are dummy 
variables. Then test the null hypothesis that E(u?) = ø? against the alterna- 
tive that 


E(u?) = y1die + yadar + Y3d3t. 
Report P values for F and nR? tests. 


If wz follows the stationary AR(1) process 
ut = put-1 +E e~UD(O,o2), |pl <1, 
show that Cov(utut—j) = Cov(utut+;) = ploz/(1—p?). Then use this result 


to show that the correlation between uz and u¢_; is just p. 


Consider the nonlinear regression model y: = x4(G) + uz. Derive the GNR for 
testing the null hypothesis that the uz are serially uncorrelated against the 
alternative that they follow an AR(1) process. 


Show how to test the null hypothesis that the error terms of the linear regres- 
sion model y = X + u are serially uncorrelated against the alternative that 
they follow an AR(4) process by means of a GNR. Derive the test GNR from 
first principles. 


Consider the following three models, where uz is assumed to be IID(0, o°): 


Ho: yt = +u 
Hı: y =B + plyt- — p) + ut 
Hə: yt = b + ut + auti 


Explain how to test Ho against Hı by using a GNR. Then show that exactly 
the same test statistic is also appropriate for testing Ho against Hə. 


Write the trace in (7.50) explicitly in terms of Px rather than Mx, and show 
that the terms containing one or more factors of Px all vanish asymptotically. 
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7.9 By direct matrix multiplication, show that, if W is given by (7.59), then yy! 
is equal to the matrix 


1 -p 0 = 0 0 
=p iste =p. os 0 0 
0 0 0 se lap? =p 
0 0 0 ae =p 1 


Show further, by direct calculation, that this matrix is proportional to the 
inverse of the matrix 2 given in (7.32). 


7.10 Show that equation (7.30), relating u to €, can be modified to take account 
of the definition (7.58) of £1, with the result that 


t=, 


ut = Et + pEt—1 +p Et—2 4 + a- Die E1. (7.94) 


The relation W'w = e implies that u = (W')~te. Use the result (7.94) to 
show that YT! can be written as 


0 pO p70 --- pto 

0 1 p n—2 

0 0 1 aca 
0 0 0 1 


where 6 = (1 — a 2. Verify by direct calculation that this matrix is the 
inverse of the W given by (7.59). 


7.11 Consider a square, symmetric, nonsingular matrix partitioned as follows 


H= E A (7.95) 


where A and B are also square symmetric nonsingular matrices. By using the 
rules for multiplying partitioned matrices (see Section 1.4), show that H =l 
can be expressed in partitioned form as 


7 D E' 
H-1= 


where 
D=(A—C'R Cy 
F = (B -CACI Y}. 
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7.13 


7.14 


7.15 


7.16 


7.17 
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Suppose that the matrix H of the previous question is positive definite. It 
therefore follows (see Section 3.4) that there exists a square matrix X such 
that H = X'X. Partition X as [X, Xo], so that 


ie = bem led 


X: Xi X Xə 


where the blocks of the matrix on the right-hand side are the same as the 
blocks in (7.95). Show that the top left block D of H`! can be expressed 
as (X M2X1 y}, where Mə = I — X2(X_! Xə) İX2. Use this result to 
show that D — A`! = (X! M2 X1)! -( 11X4) is a positive semidefinite 
matrix. 


Consider testing for first-order serial correlation of the error terms in the 
regression model 


y = byi +u, |S|<1, (7.96) 


where yı is the vector with typical element y—1, by use of the statistics 
tanr and tgp defined in (7.51) and (7.52), respectively. Show first that the 
vector denoted as Mx @1 in (7.51) and (7.52) is equal to -GMx y2, where 
y2 is the vector with typical element y:;~2, and B is the OLS estimate of 8 
from (7.96). Then show that, as n — oo, tanr tends to the random vari- 
able T = ou” plim nV? (By, — y2)'u, whereas tgp tends to the same random 
variable times 8. Show finally that tanr, but not tsr, provides an asymptot- 
ically correct test, by showing that the random variable 7 is asymptotically 
distributed as N(0, 1). 


The file money.data contains seasonally adjusted quarterly data for the loga- 
rithm of the real money supply, m+, real GDP, yz, and the 3-month Treasury 
Bill rate, rz, for Canada for the period 1967:1 to 1998:4. A conventional 
demand for money function is 


me = G1 + Bort + Baye + Game—1 + ut. (7.97) 


Estimate this model over the period 1968:1 to 1998:4, and then test it for 
AR(1) errors using two different GNRs that differ in their treatment of the 
first observation. 


Use nonlinear least squares to estimate, over the period 1968:1 to 1998:4, 
the model that results if ug in (7.97) follows an AR(1) process. Then test 
the common factor restrictions that are implicit in this model. Calculate an 
asymptotic P value for the test. 


Test the common factor restrictions of Exercise 7.15 again using a GNR. 
Calculate both an asymptotic P value and a bootstrap P value based on at 
least B = 99 bootstrap samples. Hint: To obtain a consistent estimate of p 
for the GNR, use the fact that the coefficient of r;_; in the unrestricted model 
(7.73) is equal to —p times the coefficient of r+. 


Use nonlinear least squares to estimate, over the period 1968:1 to 1998:4, 
the model that results if uz in (7.97) follows an AR(2) process. Is there any 
evidence that an AR(2) process is needed here? 
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7.18 


7.19 


7.20 


7.22 


The algorithm called iterated Cochrane-Orcutt, alluded to in Section 7.8, is 
just iterated feasible GLS without the first observation. This algorithm is 
begun by running the regression y = X68 + u by OLS, preferably omitting 
observation 1, in order to obtain the first estimate of 3. The residuals from this 
equation are then used to estimate p according to equation (7.69). What is the 
next step in this procedure? Complete the description of iterated Cochrane- 
Orcutt as iterated feasible GLS, showing how each step of the procedure can 
be carried out using an OLS regression. 


Show that, when the algorithm converges, conditions (7.68) for NLS esti- 
mation are satisfied. Also show that, unlike iterated feasible GLS including 
observation 1, this algorithm must eventually converge, although perhaps only 
to a local, rather than the global, minimum of SSR(Q, p). 


Consider once more the model that you estimated in Exercise 7.15. Estimate 
this model using the iterated Cochrane-Orcutt algorithm, using a sequence of 
OLS regressions, and see how many iterations are needed to achieve the same 
estimates as those achieved by NLS. Compare this number with the number 
of iterations used by NLS itself. 


Repeat the exercise with a starting value of 0.5 for p instead of the value of 0 
that is conventionally used. 


Test the hypothesis that the error terms of the linear regression model (7.97) 
are serially uncorrelated against the alternatives that they follow the simple 
AR(A4) process ut = paut—1+e¢ and that they follow a general AR(4) process. 


Test the hypothesis that the error terms of the nonlinear regression model 
you estimated in Exercise 7.15 are serially uncorrelated against the same two 
alternative hypotheses. Use Gauss-Newton regressions. 


Consider the linear regression model 
y = Xoßo + X11 + X2B2+u, u~ ID(0,07I), (7.98) 


where there are n observations, and ko, k1, and kg denote the numbers of 
parameters in Bo, G1, and B2, respectively. Let Ho denote the hypothesis 
that Bı = 0 and Bg = 0, Hı denote the hypothesis that Gg = 0, and Hə 
denote the model (7.98) with no restrictions. 


Show that the F statistics for testing Hp against Hı and for testing Hı against 
Hə are asymptotically independent of each other. 


This question uses data on daily returns for the period 1989-1998 for shares 
of Mobil Corporation from the file daily-crsp.data. These data are made 
available by courtesy of the Center for Research in Security Prices (CRSP); 
see the comments at the bottom of the file. Regress these returns on a constant 
and themselves lagged once, twice, three, and four times, dropping the first 
four observations. Then test the null hypothesis that all coefficients except 
the constant term are equal to zero, as they should be if market prices fully 
reflect all available information. Perform a heteroskedasticity-robust test by 
running two HRGNRs, and report P values for both tests. 
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Consider the fixed-effects model (7.84). Show that, under mild regularity con- 
ditions, which you should specify, the OLS estimator BrE tends in probability 
to the true parameter vector Bo as m, the number of cross-sectional units, 
tends to infinity, while T, the number of time periods, remains fixed. 


Suppose that 
y = XB +v +e, (7.99) 


where there are n = mT observations, y is an n-vector with typical element 
Yit, X is an n x k matrix with typical row X;+, € is an n-vector with typical 
element £it, and v is an n-vector with v; repeated in the positions that 


correspond to yi1 through y;7. Let the v; have variance o2 and the ¢;; have 


variance ae. Given these assumptions, show that the variance of the error 


terms in regression (7.88) is 02 + o2 /T. 
Show that, for X defined in (7.87), 


mM? = * (ip — AB), 


€ 


where P, = a(i) te! = Tea’, and 


Then use this result to show that the GLS estimates of 8 may be obtained 
by running regression (7.91). What is the covariance matrix of the GLS 
estimator? 


Suppose that, in the error-components model (7.99), none of the columns of X 
displays any within-group variation. Recall that, for this model, the data are 
balanced, with m groups and T observations per group. Show that the OLS 
and GLS estimators are identical in this special case. Then write down the 
true covariance matrix of both these estimators. How is this covariance matrix 
related to the usual one for OLS that would be computed by a regression 
package under classical assumptions? What happens to this relationship as 
T and p, the correlation of the error terms within groups, change? 
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Estimation 


8.1 Introduction 


In Section 3.3, the ordinary least squares estimator B was shown to be consis- 
tent under condition (3.10), according to which the expectation of the error 
term u, associated with observation t is zero conditional on the regressors X; 
for that same observation. As we saw in Section 4.5, this condition can also 
be expressed either by saying that the regressors X; are predetermined or by 
saying that the error terms u, are innovations. When condition (3.10) does 
not hold, the consistency proof of Section 3.3 is not applicable, and the OLS 
estimator will, in general, be biased and inconsistent. 


It is not always reasonable to assume that the error terms are innovations. 
In fact, as we will see in the next section, there are commonly encountered 
situations in which the error terms are necessarily correlated with some of the 
regressors for the same observation. Even in these circumstances, however, it 
is usually possible, although not always easy, to define an information set Q4 
for each observation such that 


E(u; |) = 0. (8.01) 


Any regressor of which the value in period t is correlated with uz; cannot 
belong to Qz. 


In Section 6.2, method of moments (MM) estimators were discussed for both 
linear and nonlinear regression models. Such estimators are defined by the 
moment conditions (6.10) in terms of a matrix W of variables, with one row 
for each observation. They were shown to be consistent provided that the tt? 
row W, of W belongs to Q;, and provided that an asymptotic identification 
condition is satisfied. In econometrics, these MM estimators are usually called 
instrumental variables estimators, or IV estimators. Instrumental variables 
estimation is introduced in Section 8.3, and a number of important results 
are discussed. Then finite-sample properties are discussed in Section 8.4, hy- 
pothesis testing in Section 8.5, and overidentifying restrictions in Section 8.6. 
Next, Section 8.7 introduces a procedure for testing whether it is actually 
necessary to use IV estimation. Bootstrap testing is discussed in Section 8.8. 
Finally, in Section 8.9, IV estimation of nonlinear regression models is dealt 
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with briefly. A more general class of MM estimators, of which both OLS and 
IV are special cases, will be the subject of Chapter 9. 


8.2 Correlation Between Error Terms and Regressors 


We now briefly discuss two common situations in which the error terms will 
be correlated with the regressors and will therefore not have mean zero con- 
ditional on them. The first one, usually referred to by the name errors in 
variables, occurs whenever the independent variables in a regression model 
are measured with error. The second situation, often simply referred to as 
simultaneity, occurs whenever two or more endogenous variables are jointly 
determined by a system of simultaneous equations. 


Errors in Variables 


For a variety of reasons, many economic variables are measured with error. For 
example, macroeconomic time series are often based, in large part, on surveys, 
and they must therefore suffer from sampling variability. Whenever there 
are measurement errors, the values economists observe inevitably differ, to a 
greater or lesser extent, from the true values that economic agents presumably 
act upon. As we will see, measurement errors in the dependent variable of a 
regression model are generally of no great consequence, unless they are very 
large. However, measurement errors in the independent variables cause the 
error terms to be correlated with the regressors that are measured with error, 
and this causes OLS to be inconsistent. 


The problems caused by errors in variables can be seen quite clearly in the 
context of the simple linear regression model. Consider the model 


y? = Bi + Boxe +u, u? ~ IID(0,o7), (8.02) 
where the variables z? and yẹ are not actually observed. Instead, we observe 


Ti = L? + viu, and 
T (8.03) 
Yt = Yi + Vx. 


Here vy, and va; are measurement errors which are assumed, perhaps not 
realistically in some cases, to be IID with variances w? and w2, respectively, 
and to be independent of x?, y?, and u?. 


If we suppose that the true DGP is a special case of (8.02) along with (8.03), 
we see from (8.03) that £? = x, — viz and y? = yz — vz. If we substitute these 
into (8.02), we find that 
Ye = G1 + balti — vie) + Ug + vz 
= By + Bor, + uz +V — Bovis 
= bi + Boxe + ur, (8.04) 
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where uz = u? + vo — Bovie. Thus Var(u;) is equal to o? + w2 + 82w?. The 
effect of the measurement error in the dependent variable is simply to increase 
the variance of the error terms. Unless the increase is substantial, this is 
generally not a serious problem. 


The measurement error in the independent variable also increases the variance 
of the error terms, but it has another, much more severe, consequence as well. 
Because z; = £? + Viz, and uz depends on vit, uz will be correlated with x; 
whenever 62 # 0. In fact, since the random part of x+ is viz, we see that 


E(uz | tz) = E(u | vie) = — b201, (8.05) 


because we assume that v1; is independent of u? and va. From (8.05), we can 
see, using the fact that E(u;) = 0 unconditionally, that 


Cov(2z, uz) = E(azuz) = E(x, E(u | a) 
= —E((a? + Vit) B2V11) = — Bow? 


This covariance is negative if 2 > 0 and positive if G2 < 0, and, since it does 
not depend on the sample size n, it will not go away as n becomes large. An 
exactly similar argument shows that the assumption that E(u; | X+) = 0 is 
false whenever any element of X; is measured with error. In consequence, the 
OLS estimator will be biased and inconsistent. 


Errors in variables are a potential problem whenever we try to estimate a 
consumption function, especially if we are using cross-section data. Many 
economic theories (for example, Friedman, 1957) suggest that household con- 
sumption will depend on “permanent” income or “life-cycle” income, but sur- 
veys of household behavior almost never measure this. Instead, they typically 
provide somewhat inaccurate estimates of current income. If we think of y; as 
measured consumption, x? as permanent income, and x+ as estimated current 
income, then the above analysis applies directly to the consumption function. 
The marginal propensity to consume is (2, which must be positive, causing 
the correlation between wu; and x; to be negative. As readers are asked to show 
in Exercise 8.1, the probability limit of (G2 is less than the true value (9. In 
consequence, the OLS estimator Bo is biased downward, even asymptotically. 


Of course, if our objective is simply to estimate the relationship between the 
observed dependent variable y and the observed independent variable 7+, 
there is nothing wrong with using ordinary least squares to estimate equation 
(8.04). In that case, us would simply be defined as the difference between 
ye and its expectation conditional on æ+. But our analysis shows that the 
OLS estimators of @, and (2 in equation (8.04) are not consistent for the 
corresponding parameters of equation (8.02). In most cases, it is parameters 
like these that we want to estimate on the basis of economic theory. 


There is an extensive literature on ways to avoid the inconsistency caused by 
errors in variables. See, among many others, Hausman and Watson (1985), 
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Leamer (1987), and Dagenais and Dagenais (1997). The simplest and most 
widely-used approach is just to use an instrumental variables estimator. 


Simultaneous Equations 


Economic theory often suggests that two or more endogenous variables are 
determined simultaneously. In this situation, as we will see shortly, all of the 
endogenous variables will necessarily be correlated with the error terms in all 
of the equations. This means that none of them may validly appear in the 
regression functions of models that are to be estimated by least squares. 


A classic example, which well illustrates the econometric problems caused by 
simultaneity, is the determination of price and quantity for a commodity at 
the partial equilibrium of a competitive market. Suppose that q is quantity 
and p is price, both of which would often be in logarithms. A linear (or 
loglinear) model of demand and supply is 


Ge = Ya Pi t+ Xf Ba + uf (8.06) 
qi = Ys Pt + XF Bs + ut, (8.07) 


where equation (8.06) is the demand function and equation (8.07) is the supply 
function. Here Xf and X; are row vectors of observations on exogenous or 
predetermined variables that appear, respectively, in the demand and supply 
functions, Bg and @, are corresponding vectors of parameters, yg and y, are 
scalar parameters, and uf and uf are the error terms in the demand and 
supply functions. Economic theory predicts that, in most cases, yq < 0 and 
Ys > 0, which is equivalent to saying that the demand curve slopes downward 
and the supply curve slopes upward. 


Equations (8.06) and (8.07) are a pair of linear simultaneous equations for 
the two unknowns p; and q. For that reason, these equations constitute what 
is called a linear simultaneous equations model. In this case, there are two 
dependent variables, quantity and price. For estimation purposes, the key 
feature of the model is that quantity depends on price in both equations. 


Since there are two equations and two unknowns, it is straightforward to solve 
equations (8.06) and (8.07) for p; and q+. This is most easily done by rewriting 
them in matrix notation as 


1 - Xf E 
ERI e 
1 — Ys Pt X; Bs Ut 
The solution to (8.08), which will exist whenever yqa 4 ys, so that the matrix 
on the left-hand side of (8.08) is nonsingular, is 


el-E ariei om 
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It can be seen from this solution that p; and q will depend on both uf and uf, 
and on every exogenous and predetermined variable that appears in either the 
demand function, the supply function, or both. Therefore, p, which appears 
on the right-hand side of equations (8.06) and (8.07), must be correlated with 
the error terms in both of those equations. If we rewrote one or both equations 
so that p; was on the left-hand side and q@ was on the right-hand side, the 
problem would not go away, because q is also correlated with the error terms 
in both equations. 


It is easy to see that, whenever we have a linear simultaneous equations model, 
there will be correlation between all of the error terms and all of the endo- 
genous variables. If there are g endogenous variables and g equations, the 
solution will look very much like (8.09), with the inverse of a g x g matrix 
premultiplying the sum of a g-vector of linear combinations of the exogenous 
and predetermined variables and a g—vector of error terms. If we want to esti- 
mate the full system of equations, there are many options, some of which will 
be discussed in Chapter 12. If we simply want to estimate one equation out 
of such a system, the most popular approach is to use instrumental variables. 


We have discussed two important situations in which the error terms will 
necessarily be correlated with some of the regressors, and the OLS estimator 
will consequently be inconsistent. This provides a strong motivation to employ 
estimators that do not suffer from this type of inconsistency. In the remainder 
of this chapter, we therefore discuss the method of instrumental variables. 
This method can be used whenever the error terms are correlated with one 
or more of the explanatory variables, regardless of how that correlation may 
have arisen. 


8.3 Instrumental Variables Estimation 


For most of this chapter, we will focus on the linear regression model 
y=XB+u, E(uu!)= 0°], (8.10) 


where at least one of the explanatory variables in the n x k matrix X is 
assumed not to be predetermined with respect to the error terms. Suppose 
that, for each t = 1,...,n, condition (8.01) is satisfied for some suitable 
information set Q;, and that we can form an n x k matrix W with typical 
row W, such that all its elements belong to Q. The k variables given by 
the k columns of W are called instrumental variables, or simply instruments. 
Later, we will allow for the possibility that the number of instruments may 
exceed the number of regressors. 


Instrumental variables may be either exogenous or predetermined, and, for a 
reason that will be explained later, they should always include any columns 
of X that are exogenous or predetermined. Finding suitable instruments may 
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be quite easy in some cases, but it can be extremely difficult in others. Many 
empirical controversies in economics are essentially disputes about whether or 
not certain variables constitute valid instruments. 


The Simple IV Estimator 
For the linear model (8.10), the moment conditions (6.10) simplify to 


W '(y — XB) =0. (8.11) 


Since there are k equations and k unknowns, we can solve equations (8.11) 
directly to obtain the simple IV estimator 


By =(W'X)'W'y. (8.12) 


This well-known estimator has a long history (see Morgan, 1990). Whenever 
W, = Qi, 


and W, is seen to be predetermined with respect to the error term. Given 
(8.13), it was shown in Section 6.2 that Gry is consistent and asymptotically 
normal under an identification condition. For asymptotic identification, this 
condition can be written as 


Swrx = plim iw'x is deterministic and nonsingular. (8.14) 
n—> co 
For identification by any given sample, the condition is just that W 'X should 
be nonsingular. If this condition were not satisfied, equations (8.11) would 
have no unique solution. 


It is easy to see directly that the simple IV estimator (8.12) is consistent, 
and, in so doing, to see that condition (8.13) can be weakened slightly. If 
the model (8.10) is correctly specified, with true parameter vector Bo, then it 
follows that 


Bw = (WTX WT XBo + (WX) Woo ei 
= Bo + (n` !W X; tn Wa. l 
Given the assumption (8.14) of asymptotic identification, it is clear that Biv 
is consistent if and only if 
plim +W'u = 0, (8.16) 
which is precisely the condition (6.16) that was used in the consistency proof 
in Section 6.2. We usually refer to this condition by saying that the error 
terms are asymptotically uncorrelated with the instruments. Condition (8.16) 
follows from condition (8.13) by the law of large numbers, but it may hold 
even if condition (8.13) does not. The weaker condition (8.16) is what is 
required for the consistency of the IV estimator. 
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Efficiency Considerations 


If the model (8.10) is correctly specified with true parameter vector Go and 
true error variance oĝ, the results of Section 6.2 show that the asymptotic 
covariance matrix of nt/? (Bry — Bo) is given by (6.25) or (6.26): 


Var ( plim n/2 (Bry — Bo)) = 05(Swtx) Swrw (Swtx) ` 


n—oo 


= oĝ plim (n` !X' Pw XY}, (8.17) 


n—> oo 


where Swtw = plimn~'W'W. If we have some choice over what instru- 
ments to use in the matrix W, it makes sense to choose them so as to minimize 
the above asymptotic covariance matrix. 


First of all, notice that, since (8.17) depends on W only through the orthogo- 
nal projection matrix Pw, all that matters is the space S(W ) spanned by the 
instrumental variables. In fact, as readers are asked to show in Exercise 8.2, 
the estimator Gry itself depends on W only through Py. This fact is closely 
related to the result that, for ordinary least squares, fitted values and residuals 
depend only on the space 8(X) spanned by the regressors. 


Suppose first that we are at liberty to choose for instruments any variables at 
all that satisfy the predeterminedness condition (8.13). Then, under reason- 
able and plausible conditions, we can characterize the optimal instruments 
for IV estimation of the model (8.10). By this, we mean the instruments that 
minimize the asymptotic covariance matrix (8.17), in the usual sense that any 
other choice of instruments leads to an asymptotic covariance matrix that 
differs from the optimal one by a positive semidefinite matrix. 


In order to determine the optimal instruments, we must know the data- 
generating process. In the context of a simultaneous equations model, a single 
equation like (8.10), even if we know the values of the parameters, cannot be a 
complete description of the DGP, because at least some of the variables in the 
matrix X are endogenous. For the DGP to be fully specified, we must know 
how all the endogenous variables are generated. For the demand-supply model 
given by equations (8.06) and (8.07), both of those equations are needed to 
specify the DGP. For a more complicated simultaneous equations model with 
g endogenous variables, we would need g equations. For the simple errors-in- 
variables model discussed in Section 8.2, we need equations (8.03) as well as 
equation (8.02) in order to specify the DGP fully. 


Quite generally, we can suppose that the explanatory variables in (8.10) satisfy 
the relation E 
X=X+V, E(V, | Q+) = 0, (8.18) 


where the tt? row of X is X; = E(X; | Q+), and X, is the tt? row of X. Thus 
equation (8.18) can be interpreted as saying that X; is the expectation of X; 
conditional on the information set Q;. It turns out that the n x k matrix 
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X provides the optimal instruments for (8.10). Of course, in practice, this 
matrix is never observed, and we will need to replace X by something that 
estimates it consistently. 


To see that X provides the optimal matrix of instruments, it is, as usual, easier 
to reason in terms of precision matrices rather than covariance matrices. For 
any valid choice of instruments, the precision matrix corresponding to (8.17) 
is oĝ times 


plim + X'PwX = plim (n X'W (n 'W'W)'n WTX). (8.19) 


m— Co n—oo 


Using (8.18) and a law of large numbers, we see that 
plim n71X'W = lim n7'E(X'W) 


= lim n-'E(X'W) = plim n~t X'W. 


n— oo n= 


(8.20) 


The second equality holds because E(V'W) = O, since, by the construction 
in (8.18), V; has mean zero conditional on W;. The last equality is just a LLN 
in reverse. Similarly, we find that plimn7'W'X = plimn-!W'X. Thus 
(8.19) becomes 

plim + XT Py X. (8.21) 
If we make the choice W = X, then (8.21) reduces to plimn~!X'X. The 
difference between this and (8.21) is just plimn~!X'My-X, which is a pos- 
itive semidefinite matrix. This shows that X is indeed the optimal choice of 
instrumental variables by the criterion of asymptotic variance. 


We mentioned earlier that all the explanatory variables in (8.10) that are exo- 
genous or predetermined should be included in the matrix W of instrumental 
variables. It is now clear why this is so. If we denote by Z the submatrix 
of X containing the exogenous or predetermined variables, then Z = Z, be- 
cause the row Z; is already contained in Q;. Thus Z is a submatrix of the 
matrix X of optimal instruments. As such, it should always be a submatrix 
of the matrix of instruments W used for estimation, even if W is not actually 


equal to X. 


The Generalized IV Estimator 


In practice, the information set Q, is very frequently specified by providing 
a list of l instrumental variables that suggest themselves for various reasons. 
Therefore, we now drop the assumption that the number of instruments is 
equal to the number of parameters and let W denote an n xl matrix of instru- 
ments. Often, lis greater than k, the number of regressors in the model (8.10). 
In this case, the model is said to be overidentified, because, in general, there 
is more than one way to formulate moment conditions like (8.11) using the 
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available instruments. If l = k, the model (8.10) is said to be just identified 
or exactly identified, because there is only one way to formulate the moment 
conditions. If 1 < k, it is said to be underidentified, because there are fewer 
moment conditions than parameters to be estimated, and equations (8.11) 
will therefore have no unique solution. 


If any instruments at all are available, it is normally possible to generate 
an arbitrarily large collection of them, because any deterministic function of 
the | components of the tt? row W, of W can be used as the tt} component 
of a new instrument.’ If (8.10) is underidentified, some such procedure is 
necessary if we wish to obtain consistent estimates of all the elements of 8. 
Alternatively, we would have to impose at least k — l restrictions on 8 so as 
to reduce the number of independent parameters that must be estimated to 
no more than the number of instruments. 


For models that are just identified or overidentified, it is often desirable to 
limit the set of potential instruments to deterministic linear functions of the 
instruments in W, rather than allowing arbitrary deterministic functions. We 
will see shortly that this is not only reasonable but optimal for linear simult- 
aneous equation models. This means that the IV estimator is unique for a 
just identified model, because there is only one k-dimensional linear space 
S(W) that can be spanned by the k = l instruments, and, as we saw earlier, 
the IV estimator for a given model depends only on the space spanned by the 
instruments. 


We can always treat an overidentified model as if it were just identified by 
choosing exactly k linear combinations of the | columns of W. The challenge 
is to choose these linear combinations optimally. Formally, we seek an l x k 
matrix J such that the n x k matrix WJ is a valid instrument matrix and 
such that the use of J minimizes the asymptotic covariance matrix of the 
estimator in the class of IV estimators obtained using an n x k instrument 
matrix of the form WJ* with arbitrary | x k matrix J*. 


There are three requirements that the matrix J must satisfy. The first of 
these is that it should have full column rank of k. Otherwise, the space 
spanned by the columns of WJ would have rank less than k, and the model 
would be underidentified. The second requirement is that J should be at 
least asymptotically deterministic. If not, it is possible that condition (8.16) 
applied to WJ could fail to hold. The last requirement is that J be chosen 
to minimize the asymptotic covariance matrix of the resulting IV estimator, 
and we now explain how this may be achieved. 


If the explanatory variables X satisfy (8.18), then it follows from (8.17) and 
(8.20) that the asymptotic covariance matrix of the IV estimator computed 


1 This procedure would not work if, for example, all of the original instruments 
were binary variables. 
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using WJ as instrument matrix is 


oê plim(n7'X' PwyX)t. (8.22) 
n—- co 

The tt? row X; of X belongs to Q; by construction, and so each element of X; 
is a deterministic function of the elements of W;. However, the deterministic 
functions are not necessarily linear with respect to W;. Thus, in general, it 
is impossible to find a matrix J such that X = WJ, as would be needed for 
WJ to constitute a set of truly optimal instruments. A natural second-best 
solution is to project X orthogonally on to the space $(W). This yields the 
matrix of instruments 


WJ = PwX =W(W'W)'W'x, (8.23) 


which implies that 7 
J=(W'Ww)'w'x. (8.24) 


We now show that these instruments are indeed optimal under the constraint 
that the instruments should be linear in W,. 


By substituting PwX for WJ in (8.22), the asymptotic covariance matrix 
becomes E 7 
og plim(n `X "Pp XJ". 
n— oo 


If we write out the projection matrix Pp,,x explicitly, we find that 


X'Pp, xX = X'PwX(X'PwX)'X'PwX = X'PywX. (8.25) 


Thus, the precision matrix for the estimator that uses instruments Py X is 
proportional to X'Py-X. For the estimator with WJ as instruments, the 
precision matrix is proportional to X'PyyX. The difference between the 
two precision matrices is therefore proportional to 


X'(Pw — Pws)X. (8.26) 


The k-dimensional subspace 8(WJ), which is the image of the orthogonal 
projection Pwy, is a subspace of the /-dimensional space §(W)), which is the 
image of Pw. Thus, by the result in Exercise 2.16, the difference Pw — Pw, is 
itself an orthogonal projection matrix. This implies that the difference (8.26) 
is a positive semidefinite matrix, and so we can conclude that (8.23) is indeed 
the optimal choice of instruments of the form WJ. 


At this point, we come up against the same difficulty as that encountered at 
the end of Section 6.2, namely, that the optimal instrument choice is infeasible, 
because we do not know X. But notice that, from the definition (8.24) of the 
matrix J, we have that 


plim J = plim(n tW 'W) tn !W'X 
= plim(n W! W) n ‘WX, (8.27) 


n— Co 
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by (8.20). This suggests, correctly, that we can use Pw X instead of PwX 
without changing the asymptotic properties of the estimator. 


If we use Pw X as the matrix of instrumental variables, the moment conditions 
(8.11) that define the estimator become 


X'Pw(y — XB) =0, (8.28) 
which can be solved to yield the generalized IV estimator, or GIV estimator, 
Bw =(X'PwX)'X'Pwy, (8.29) 


which is sometimes just abbreviated as GIVE. The estimator (8.29) is indeed 
a generalization of the simple estimator (8.12), as readers are asked to verify 
in Exercise 8.3. For this reason, we will usually refer to the IV estimator 
without distinguishing the simple from the generalized case. 


The generalized IV estimator (8.29) can also be obtained by minimizing the 
IV criterion function, which has many properties in common with the sum 
of squared residuals for models estimated by least squares. This function is 
defined as follows: 


Q(B,y) = (y — XB)'Pw(y — XB). (8.30) 


Minimizing Q(B, y) with respect to 8 yields the estimator (8.29), as readers 
are asked to show in Exercise 8.4. 


Identifiability and Consistency of the IV Estimator 


In Section 6.2, we defined in (6.12) a k-vector a( 8) of deterministic functions 
as the probability limits of the functions used in the moment conditions that 
define an estimator, and we saw that the parameter vector 8 is asymptotically 
identified if two asymptotic identification conditions are satisfied. The first 
condition is that a(o) = 0, and the second is that a(3) 4 0 for all B Æ Bo. 


The analogous vector of functions for the IV estimator is 


a(8) = plim + X'Pyw(y — XB) 
a e (8.31) 
= Sxtw(Swtw) plim -W (y - XB), 


n— oo 


where Sxtw = 5 ase which was defined in (8.14), and Sytw was de- 
fined just after (8.17). For asymptotic identification, we assume that both 
these matrices exist and have full rank. This assumption is analogous to the 
assumption that 1/n times the matrix X'X has probability limit Syrty, a 
matrix with full rank, which we originally made in Section 3.3 when we proved 
that the OLS estimator is consistent. If Sy-+y does not have full rank, then 
at least one of the instruments is perfectly collinear with the others, asymp- 
totically, and should therefore be dropped. If Syt x does not have full rank, 
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then the asymptotic version of the moment conditions (8.28) has fewer than k 
linearly independent equations, and these conditions therefore have no unique 
solution. 


If Bo is the true parameter vector, then y— Xo = u, and the right-hand side 
of (8.31) vanishes under the assumption (8.16) used to show the consistency 
of the simple IV estimator. Thus a(@9) = 0, and the first condition for 
asymptotic identification is satisfied. 


The second condition requires that a(@) 4 0 for all B 4 Bo. It is easy to see 
from (8.31) that 


a(b) = Sxtw(Swtw) 'Swrx (Go — B). 


For this to be nonzero for all nonzero Bo — Ø, it is necessary and sufficient 
that the matrix Sxtw(Swtw) |Swrtx should have full rank k. This will 
be the case if the matrices Sytw and Syrtx both have full rank, as we 
have assumed. If l = k, the conditions on the two matrices Sywrw and 
Swtx simplify, as we saw when considering the simple IV estimator, to the 
single condition (8.14). The condition that Sxtw(Swtw) ‘Swrx has full 
rank can also be used to show that the probability limit of 1/n times the IV 
criterion function (8.30) has a unique global minimum at 8 = 6o, as readers 
are asked to show in Exercise 8.5. 


The two asymptotic identification conditions are sufficient for consistency. 
Because we are dealing here with linear models, there is no need for a sophis- 
ticated proof of this fact; see Exercise 8.6. The key assumption is, of course, 
(8.16). If this assumption did not hold, because any of the instruments was 
asymptotically correlated with the error terms, the first of the asymptotic 
identification conditions would not hold either, and the IV estimator would 
not be consistent. 


Asymptotic Distribution of the IV Estimator 


Like every estimator that we have studied, the IV estimator is asymptot- 
ically normally distributed with an asymptotic covariance matrix that can 
be estimated consistently. The asymptotic covariance matrix for the simple 
IV estimator, expression (8.17), turns out to be valid for the generalized IV 
estimator as well. To see this, we replace W in (8.17) by the asymptotically 
optimal instruments Pw X. As in (8.25), we find that 


X'Ppy xX = X'PwX(X'PwX)'X'PywX = X'PwX, 


from which it follows that (8.17) is unchanged if W is replaced by PwX. 


It can also be shown directly that (8.17) is the asymptotic covariance matrix 
of the generalized IV estimator. From (8.29), it follows that 


n'? (Bry — Bo) = (n X'PwX) n"? X Pwu. (8.32) 
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Under reasonable assumptions, a central limit theorem can be applied to 
the expression n~!/2W Tu, which allows us to conclude that the asymptotic 
distribution of this expression is multivariate normal, with mean zero and 
covariance matrix 


. 1 
lim -W 'E(uu' )W = 05 Swrw, (8.33) 
since we assume that E(wu') = o2I. With this result, it can be shown quite 


simply that (8.17) is the asymptotic covariance matrix of Bry; see Exercise 8.7. 


In practice, since o@ is unknown, we use 


Var (B1v) = 62(X PwX)7! (8.34) 


to estimate the covariance matrix of Bry. Here G? is 1 /n times the sum of the 
squares of the components of the residual vector y — XB. In contrast to the 
OLS case, there is no good reason to divide by anything other than n when 
estimating o°. Because IV estimation minimizes the IV criterion function and 
not the sum of squared residuals, IV residuals are not necessarily too small. 
Nevertheless, many regression packages divide by n — k instead of by n. 


The choice of instruments will usually affect the asymptotic covariance matrix 
of the IV estimator. If some or all of the columns of X are not contained in 
the span 8(W) of the instruments, an efficiency gain is potentially available 
if that span is made larger. Readers are asked in Exercise 8.8 to demonstrate 
formally that adding an extra instrument by appending a new column to W 
will, in general, reduce the asymptotic covariance matrix. Of course, it cannot 
be made smaller than the lower bound o2(X'X)~1, which is attained if the 
optimal instruments X are available. 


When all the regressors can validly be used as instruments, we have X = X, 
and the efficient IV estimator coincides with the OLS estimator, as the Gauss- 
Markov Theorem predicts. 


Two-Stage Least Squares 


The IV estimator (8.29) is commonly known as the two-stage least squares, 
or 2SLS, estimator, because, before the days of good econometrics software 
packages, it was often calculated in two stages using OLS regressions. In the 
first stage, each column a;,i=1,...,k, of X is regressed on W, if necessary. 
If a regressor x; is a valid instrument, it is already (or should be) one of the 
columns of W. In that case, since Pwa; = x, no first-stage regression is 
needed, and we say that such a regressor serves as its own instrument. 


The fitted values from the first-stage regressions, plus the actual values of 
any regressors that serve as their own instruments, are collected to form the 
matrix Pw X. Then the second-stage regression, 


y = PwXB+u, (8.35) 
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is used to obtain the 2SLS estimates. Because Py is an idempotent matrix, 
the OLS estimate of B from this second-stage regression is 


Boss = (X'Pw XY X Pwy, 


which is identical to (8.29), the generalized IV estimator Bry. 


If this two-stage procedure is used, some care must be taken when estimating 
the standard error of the regression and the covariance matrix of the parameter 
estimates. The OLS estimate of o? from regression (8.35) is 


> lly—PwXBr|/? 
Ss = á 
n—k 


(8.36) 


In contrast, the estimate that was used in the estimated IV covariance matrix 
(8.34) is 
„2 _ lly- XB? 
= A 
n 


(8.37) 


These two estimates of o? are not asymptotically equivalent, and s? is not 
consistent. The reason is that the residuals from regression (8.35) do not 
tend to the corresponding error terms as n — oo, because the regressors in 
(8.35) are not the true explanatory variables. Therefore, 1/(n — k) times the 
sum of squared residuals is not a consistent estimator of g?. Of course, no 
regression package providing IV or 2SLS estimation would ever use (8.36) 
to estimate o?. Instead, it would use (8.37), or at least something that is 
asymptotically equivalent to it. 


Two-stage least squares was invented by Theil (1953) and Basmann (1957) 
at a time when computers were very primitive. Consequently, despite the 
classic papers of Durbin (1954) and Sargan (1958) on instrumental variables 
estimation, the term “two-stage least squares” came to be very widely used 
in econometrics, even when the estimator is not actually computed in two 
stages. We prefer to think of two-stage least squares as simply a particular 
way to compute the generalized IV estimator, and we will use Bry rather than 
Bos1g to denote that estimator. 


8.4 Finite-Sample Properties of IV Estimators 


Unfortunately, the finite-sample distributions of IV estimators are much more 
complicated than the asymptotic ones. Indeed, except in very special cases, 
these distributions are unknowable in practice. Although it is consistent, the 
IV estimator for just identified models has a distribution with such thick tails 
that its expectation does not even exist. With overidentified models, the 
expectation of the estimator exists, but it is in general different from the true 
parameter value, so that the estimator is biased, often very substantially so. 
In consequence, investigators can easily make serious errors of inference when 
interpreting IV estimates. 
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The biases in the OLS estimates of a model like (8.10) arise because the 
error terms are correlated with some of the regressors. The IV estimator 
solves this problem asymptotically, because the projections of the regressors 
on to §(W) are asymptotically uncorrelated with the error terms. However, 
there will always still be some correlation in finite samples, and this causes 
the IV estimator to be biased. 


Systems of Equations 


In order to understand the finite-sample properties of the IV estimator, we 
need to consider the model (8.10) as part of a system of equations. We 
therefore change notation somewhat and rewrite (8.10) as 


y = ZB, +YB.+u, E(uu') = °l, (8.38) 


where the matrix of regressors X has been partitioned into two parts, namely, 
an n x kı matrix of exogenous and predetermined variables, Z, and an n x k2 
matrix of endogenous variables, Y, and the vector B has been partitioned 
conformably into two subvectors 3, and 32. There are assumed to be 1 > k 
instruments, of which kı are the columns of the matrix Z. 


The model (8.38) is not fully specified, because it says nothing about how the 
matrix Y is generated. For each observation t, t = 1,...,n, the value y of 
the dependent variable and the values Y; of the other endogenous variables 
are assumed to be determined by a set of linear simultaneous equations. The 
variables in the matrix Y are called current endogenous variables, because 
they are determined simultaneously, row by row, along with y. Suppose that 
all the exogenous and predetermined explanatory variables in the full set of 
simultaneous equations are included in the n x l instrument matrix W, of 
which the first kı columns are those of Z. Then, as can easily be seen by 
analogy with the explicit result (8.09) for the demand-supply model, we have 
for each endogenous variable y;, i = 0,1,...,k2, that 


Yi = Wr; + U4; E(u; | W) =0. (8.39) 


Here yo = y, and the y;, for i = 1,...,k2, are the columns of Y. The 7; 
are l-vectors of unknown coefficients, and the v; are n-vectors of error terms 
that are innovations with respect to the instruments. 


Equations like (8.39), which have only exogenous and predetermined variables 
on the right-hand side, are called reduced form equations, in contrast with 
equations like (8.38), which are called structural equations. Writing a model 
as a set of reduced form equations emphasizes the fact that all the endogenous 
variables are generated by similar mechanisms. In general, the error terms for 
the various reduced form equations will display contemporaneous correlation: 
If vz; denotes a typical element of the vector v;i, then, for observation t, the 
reduced form error terms v;; will generally be correlated among themselves 
and correlated with the error term u+ of the structural equation. 
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A Simple Example 


In order to gain additional intuition about the properties of the IV estimator in 
finite samples, we consider the very simplest nontrivial example, in which the 
dependent variable y is explained by only one variable, which we denote by æ. 
The regressor æ is endogenous, and there is available exactly one exogenous 
instrument, w. In order to keep the example reasonably simple, we suppose 
that all the error terms, for both y and æ, are normally distributed. Thus the 
DGP that simultaneously determines x and y can be written as 


Y = Lpo + Cuu, £L = WTO + dyv, (8.40) 


analogously to (8.39). By explicitly writing cu and oy as the standard devia- 
tions of the error terms, we can define the vectors u and v to be multivariate 
standard normal, that is, distributed as N(0,I). There is contemporaneous 
correlation of u and v, so that we have E(uzv;) = p, for some correlation 
coefficient p such that —1 < p < 1. The result of Exercise 4.4 shows that the 
expectation of up conditional on v; is pv, and so we can write u = pv + ù, 
where u has mean zero conditional on v. 


In this simple, just identified, setup, the IV estimator of the parameter ( is 
By = (w'z) w'y = bo + ou (w'a) wu. (8.41) 


This expression is clearly unchanged if the instrument w is multiplied by an 


arbitrary scalar, and so we can, without loss of generality, rescale w so that 


w'w = 1. Then, using the second equation in (8.40), we find that 


3 6 = oy, wu E o,w' (pv +u) 
IV 0 to topwy Tot ow v 


Let us now compute the expectation of this expression conditional on v. Since, 
by construction, E(w, |v) = 0, we obtain 
_ poy Z 


E(81v — bo) = —= (8.42) 


Oy Bae 


+ 


where we have made the definitions a = mo/oy, and z = w'v. Given our 


rescaling of w, it is easy to see that z ~ N(0,1). 


If p = 0, the right-hand side of (8.42) vanishes, and so the unconditional 
expectation of Gry — Bo vanishes as well. Therefore, in this special case, Gry 
is unbiased. This is as expected, since, if p = 0, the regressor æ is uncorrelated 
with the error vector u. If p 4 0, however, (8.42) is equal to a nonzero factor 
times the random variable z/(a + z). Unless a = 0, it turns out that this 
random variable has no expectation. To see this, we can try to calculate it. 
If it existed, it would be 


p(—= e o e (8.43) 


at z oo ATL 
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where, as usual, ¢(-) is the density of the standard normal distribution. It is 
a fairly simple calculus exercise to show that the integral in (8.43) diverges in 
the neighborhood of z = —a. 


If to = 0, then a = 0. In this rather odd case, x = oyv is just noise, as though 
it were an error term. Therefore, since z/(a+ z) reduces to 1, the expectation 
exists, but it is not zero, and (yy is therefore biased. 


When a # 0, which is the usual case, the IV estimator (8.41) is neither biased 
nor unbiased, because it has no expectation for any finite sample size n. This 
may seem to contradict the result according to which (yy is asymptotically 
normal, since all the moments of the normal distribution exist. However, 
the fact that a sequence of random variables converges to a limiting ran- 
dom variable does not necessarily imply that the moments of the variables 
in the sequence converge to those of the limiting variable; see Davidson and 
MacKinnon (1993, Section 4.5). The estimator (8.41) is a case in point. For- 
tunately, this possible failure to converge of the moments does not extend to 
the CDFs of the random variables, which do indeed converge to that of the 
limit. Consequently, P values and the upper and lower limits of confidence 
intervals computed with the asymptotic distribution are legitimate approxi- 
mations, in the sense that they become more and more accurate as the sample 
size increases. 


A less simple calculation can be used to show that, in the overidentified case, 
the first | — k moments of Bry exist; see Kinal (1980). This is consistent 
with the result we have just obtained for an exactly identified model, where 
l— k = 0, and the IV estimator has no moments at all. When the mean of 
Bw exists, it is almost never equal to Gp. Readers will have a much clearer 
idea of the impact of the existence or nonexistence of moments, and of the 
bias of the IV estimator, if they work carefully through Exercises 8.10 to 8.13, 
in which they are asked to generate by simulation the EDFs of the estimator 
in different situations. 


The General Case 


We now return to the general case, in which the structural equation (8.38) 
is being estimated, and the other endogenous variables are generated by the 
reduced form equations (8.39) for i = 1,...,k2, which correspond to the first- 
stage regressions for 25LS. We can group the vectors of fitted values from 
these regressions into an n x k matrix PwY. The generalized IV estima- 
tor is then equivalent to a simple IV estimator that uses the instruments 
PwX = |Z PwY |. By grouping the /-vectors m;, i = 1,...,k2 into an 
| x kə matrix IT and the vectors of error terms v; into an n x kọ matrix Vo, 
we see that 


PwX =|Z PwY]=[4 Pw(WIh + V2) 


8.44 
=[Z WII, + PwV2] = WII + PwV. (8-44) 
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Here V is an n x k matrix of the form [O V2], where the zero block has 
dimension n x kı, and IT is an | x k matrix, which can be written as IT = 
|I; Io], where the l x kı matrix I, is a kı x kı identity matrix sitting on 
top of an (l — kı) x kı zero matrix. It is easily checked that these definitions 
make the last equality in (8.44) correct. Thus PwX has two components: 
WIT, which by assumption is uncorrelated with u, and PwV, which will 
almost always be correlated with u. 


If we substitute the rightmost expression of (8.44) into (8.32), eliminating the 
factors of powers of n, which are unnecessary in the finite-sample context, we 
find that 


Bw — Bo = (TWW + WV 4+V'WI+V PV) as 
x (I'W "u + V'Pwu). l 


To make sense of this rather messy expression, first set V = O. The result is 
Bw — Bo = (WwW WHE) HW 'u. (8.46) 


If V = O, the supposedly endogenous variables Y are in fact exogenous or 
predetermined, and it can be checked (see Exercise 8.14) that, in this case, 
Bry is just the OLS estimator for model (8.10). 


If V is not zero, but is independent of u, then we see immediately that the 
expectation of (8.45) conditional on V is zero. This case is the analog of the 
case with p = 0 in (8.42). Note that we require the full independence of V 
and u for this to hold. If instead V were just predetermined with respect 
to u, the IV estimator would still have a finite-sample bias, for exactly the 
same reasons as those leading to finite-sample bias of the OLS estimator with 
predetermined but not exogenous explanatory variables. 


When V and u are contemporaneously correlated, it can be shown that all 
the terms in (8.45) which involve V do not contribute asymptotically; see 
Exercise 8.15. Thus we can see that any discrepancy between the finite- 
sample and asymptotic distributions of Gry — Bo must arise from the terms 
in (8.45) that involve V. In fact, in the absence of other features of the model 
that could give rise to finite-sample bias, such as lagged dependent variables, 
the poor finite-sample properties of the IV estimator arise solely from the 
contemporaneous correlation between PywV and u. In particular, the second 
term in the second factor of (8.45) will generally have a nonzero mean, and 
this term can be a major source of bias when the correlation between u and 
some of the columns of V is high. 


If the terms involving V in (8.45) are relatively small, the finite-sample distri- 
bution of the IV estimator is likely to be well approximated by its asymptotic 
distribution. However, if these terms are not small, the asymptotic approxi- 
mation may be poor. Thus our analysis suggests that there are three situations 
in which the IV estimator is likely to have poor finite-sample properties. 
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e When l, the number of instruments, is large, W will be able to explain 
much of the variation in V; recall from Section 3.8 that adding additional 
regressors can never reduce the R? of a regression. With large l, conse- 
quently, PwV will be relatively large. When the number of instruments 
is extremely large relative to the sample size, the first-stage regressions 
may fit so well that Pw Y is very similar to Y. In this situation, the 
IV estimates may be almost as biased as the OLS ones. 


e When at least some of the reduced-form regressions (8.39) fit poorly, 
in the sense that the R? is small or the F statistic for all the slope 
coefficients to be zero is insignificant, the model is said to suffer from 
weak instruments. In this situation, even if PwV is no larger than usual, 
it may nevertheless be large relative to WIT. When the instruments are 
very weak, the finite-sample distribution of the IV estimator may be very 
far from its asymptotic distribution even in samples with many thousands 
of observations. An example of this is furnished by the case in which a = 0 
in (8.42) in our simple example with one regressor and one instrument. 
As we saw, the distribution of the estimator is quite different when a = 0 
from what it is when a Æ 0; the distribution when a = 0 may well be 
similar to the distribution when a = 0. 


e When the correlation between u and some of the columns of V is very 
high, V'Pyru will tend to be relatively large. Whether it will be large 
enough to cause serious problems for inference will depend on the sample 
size, the number of instruments, and how well the instruments explain 
the endogenous variables. 


It may seem that adding additional instruments will always increase the finite- 
sample bias of the IV estimator, and Exercise 8.13 illustrates a case in which 
it does. In that case, the additional instruments do not really belong in the 
reduced-form regressions. However, if the instruments truly belong in the 
reduced-form regressions, adding them will alleviate the weak instruments 
problem, and that can actually cause the bias to diminish. 


Finite-sample inference in models estimated by instrumental variables is a 
subject of active research in econometrics. Relatively recent papers on this 
topic include Nelson and Startz (1990a, 1990b), Buse (1992), Bekker (1994), 
Bound, Jaeger, and Baker (1995), Dufour (1997), Staiger and Stock (1997), 
Wang and Zivot (1998), Zivot, Startz, and Nelson (1998), Angrist, Imbens, 
and Krueger (1999), Blomquist and Dahlberg (1999), Donald and Newey 
(2001), Hahn and Hausman (2002), Kleibergen (2002), and Stock, Wright, 
and Yogo (2002). There remain many unsolved problems. 
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8.5 Hypothesis Testing 


Because the finite-sample distributions of IV estimators are almost never 
known, exact tests of hypotheses based on such estimators are almost never 
available. However, large-sample tests can be performed in a variety of ways. 
Since many of the methods of performing these tests are very similar to meth- 
ods that we have already discussed in Chapters 4 and 6, there is no need to 
discuss them in detail. 


Asymptotic t and Wald Statistics 


When there is just one restriction, the easiest approach is simply to compute 
an asymptotic t test. For example, if we wish to test the hypothesis that 
Bi = Boi, where (; is one of the regression parameters, then a suitable test 
statistic is 


A 


Bi — Bio 
to = = 8.47 
7 (Var( o” si 


where Ĝi is the IV estimate of 6;, and Var(ĝ;) is the it diagonal element 
of the estimated covariance matrix, (8.34). This test statistic will not follow 
the Student’s t distribution in finite samples, but it will be asymptotically 
distributed as N(0,1) under the null hypothesis. 


For testing restrictions on two or more parameters, the natural analog of 
(8.47) is a Wald statistic. Suppose that G is partitioned as [61 62], and we 
wish to test the hypothesis that G2 = G29. Then, as in (6.71), the appropriate 
Wald statistic is 


We, = (Bo = oo)" (War (B2)) (Be — B29), (8.48) 


where Var (G2) is the submatrix of (8.34) that corresponds to the vector 62. 
This Wald statistic can be thought of as a generalization of the asymptotic t 
statistic: When (2 is a scalar, the square root of (8.48) is (8.47). 


The IV Variant of the GNR 


In many circumstances, the easiest way to obtain asymptotically valid test 
statistics for models estimated using instrumental variables is to use a variant 
of the Gauss-Newton regression. For the model (8.10), this variant, called the 
IVGNR, takes the form 


y — XB = PwXb + residuals. (8.49) 


As with the usual GNR, the variables of the IVGNR must be evaluated at 
some prespecified value of 3 before the regression can be run, in the usual 
way, using ordinary least squares. 


The IVGNR has the same properties relative to model (8.10) as the ordinary 
GNR has relative to linear and nonlinear regression models estimated by least 
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squares. The first property is that, if (8.49) is evaluated at G = Bry, then the 
regressors Py X are orthogonal to the regressand, because the orthogonality 
conditions, namely, g 

X'Pw(y — Xßw) = 0, 


are just the moment conditions (8.28) that define Bry. 


The second property is that, if (8.49) is again evaluated at 8 = ere the 
estimated OLS covariance matrix is asymptotically valid. This matrix is 


°(X'Pw XY !. (8.50) 


Here s? is the sum of squared residuals from (8.49), divided by n — k. Since 
b = 0 because of the orthogonality of the regressand and the regressors, those 
residuals are the components of the vector y — X Brv, that is, the IV residuals 
from (8.10). It follows that (8.50), which has exactly the same form as (8.34), 
is a consistent estimator of the covariance matrix of Bw, where “consistent 
estimator” is used in the sense of (5.22). As with the ordinary GNR, the 
estimator §? obtained by running (8.49) with 6 = is consistent for the error 
variance g? if B is root-n consistent; see Exercise 8.16. 

The third property is that, like the ordinary GNR, the IVGNR permits one- 
step efficient estimation. For linear models, this is true if any value of G 
is used in (8.49). If we set GB = G3, then running (8.49) gives the artificial 
parameter estimates 


b = (X'PwX) 'X'Pw(y — XB) = Bw — É, 


from which it follows that B +b= Biv for all B. In the context of nonlinear 
IV estimation (see Section 8.9), this result, like the one above for 67, becomes 
an approximation that is asymptotically valid only if B is a root-n consistent 
estimator of the true Jo. 


Tests Based on the IVGNR 

If the restrictions to be tested are all linear restrictions, there is no further 
loss of generality if we suppose that they are all zero restrictions. Thus the 
null and alternative hypotheses can be written as 


Ho: y = X13 +u, and (8.51) 
Hı: y= Xıßı + X262 + u, (8.52) 


where the matrices X; and Xə are, respectively, n x kı and n x kg, Bı is a 
kı-vector, and (2 is a k2-vector. As elsewhere in this chapter, it is assumed 
that E(uu!) = o’I. Any or all of the columns of X = [X, X2] may be 
correlated with the error terms. It is assumed that there exists an n x l 
matrix W of instruments, which are asymptotically uncorrelated with the 
error terms, and that 1 > k = kı + ko. 
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The same matrix of instruments is assumed to be used for the estimation of 
both Ho and Hı. While this assumption is natural if we start by estimating 
H; and then impose restrictions on it, it may not be so natural if we start 
by estimating Ho and then estimate a less restricted model. A matrix of 
instruments that would be entirely appropriate for estimating Hp may be 
inappropriate for estimating Hı, either because it omits some columns of Xə 
that are known to be uncorrelated with the errors, or because the number of 
instruments is greater than kı but less than kı + kg. It is essential that the 
W matrix used should be appropriate for estimating Hı as well as Ho. 


Exactly the same reasoning as that used in Section 6.7, based on the three 
properties of the IVGNR established in the previous subsection, shows that 
an asymptotically valid test of Ho against the alternative Hı is provided by 
the artificial F statistic obtained from running the following two IVGNRs, 
which correspond to Hp and Hj, respectively: 


IVGNRo: y — X18; = PwXıbı + residuals, and (8.53) 
IVGNR,: yY— XB; = PwXıbı + Pw X2b2 + residuals. (8.54) 


As in Section 6.7, it is necessary to evaluate both IVGNRs at the same para- 
meter values. Since these values must satisfy the null hypothesis, G2 = 0. 
This is why the regressand, which is the same for both IVGNRs, does not 
depend on Xə. The artificial F statistic is 

(SSRo — SSR1)/k2 


P= "SSRi/(n—k) (Ss) 


where SSRo and SSR, denote the sums of squared residuals from (8.53) and 
(8.54), respectively. 


Because both Ho and Hj are linear models, the value of B used to evaluate 
the regressands of (8.53) and (8.54) has no effect on the difference between 
the SSRs of the two regressions, which, when divided by k2, is the numerator 
of the artificial F statistic. To see this, we need to write the SSRs from the 
two IVGNRs as quadratic forms in the vector y — X 1, and the projection 
matrices Mp,,x, and Mp,,x, respectively. Thus 


SSRo — SSR; = (y — X1 61) (Mpwx, — Mp,,x)(y — X11) 
=(y- X1B1)'(Peyx = Prix jy = Xı6ı), (8.56) 


where Pp,,x, and Pp,,x project orthogonally on to $S(PwXı) and 8(PwX), 
respectively, and Mp,,x, and Mp,,x are the complementary projections. In 
Exercise 8.17, readers are asked to show that expression (8.56) is equal to the 
much simpler expression 


y | (Ppwx = Ppwx)Y, (8.57) 


which does not depend in any way on É. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


8.5 Hypothesis Testing 331 


It is important to note that, although the difference between the SSRs of (8.53) 
and (8.54) does not depend on Ó, the same is not true of the individual SSRs. 
Thus, if different values of 3 were used for (8.53) and (8.54), we would get a 
wrong answer. Similarly, it is essential that the same instrument matrix W 
should be used in both regressions, since otherwise none of the above analysis 
would go through. It is essential that B be a consistent estimator under 
the null hypothesis. Otherwise, the denominator of the test statistic (8.55) 
will not estimate o° consistently, and (8.55) will not follow the F(k2,n — k) 
distribution asymptotically. If (8.53) and (8.54) are correctly formulated, 
with the same B and the same instrument matrix W, it can be shown that k2 
times the artificial F statistic (8.55) is equal to the Wald statistic (8.48) with 
29 = 0, except for the estimate of the error variance in the denominator; see 
Exercise 8.18. 


Although the theory presented in Section 6.7 is enough to justify the test 
based on the IVGNR that we have developed above, it is instructive to check 
that kg times the F statistic is indeed asymptotically distributed as y?(k2) 
under the null hypothesis Ho. Because the numerator expression (8.56) does 
not depend on B, it is perfectly valid to evaluate it with 8 equal to the true 
parameter vector Bo. Since y — Xp is equal to u, the vector of error terms, 
expression (8.56) becomes 


u!'(Ppy x — Ppyx,)U. (8.58) 


This is a quadratic form in the vector u and the difference of two projection 
matrices, one of which projects on to a subspace of the image of the other. 
Using the result of Exercise 2.16, we see that the difference is itself an orthog- 
onal projection matrix, projecting on to a space of dimension k — kı = kə. 
If the vector u were assumed to be normally distributed, and X and W 
were fixed, we could use Theorem 4.1 to show that 1/02 times (8.58) is dis- 
tributed as x?(k2). In Exercise 8.19, readers are invited to show that, when 
the error terms are asymptotically uncorrelated with the instruments, (8.58) 
is asymptotically distributed as o@ times a variable that follows the y?(k2) 
distribution. Since the denominator of the F statistic (8.55) is a consistent 
estimator of 02, we see that kz times the F statistic is indeed asymptotically 
distributed as y?(k2). 


Tests Based on Criterion Functions 


It may appear strange to advocate using the IVGNR to compute an artificial 
F statistic when one can more easily compute a real F statistic from the 
SSRs obtained by IV estimation of (8.51) and (8.52). However, such a “real” 
F statistic is not valid, even asymptotically. This can be seen by evaluating the 
IVGNRs (8.53) and (8.54) at the restricted estimates 3, where ĝ is a k-vector 
with the first kı components equal to the IV estimates Bı from (8.51) and 
the last k2 components zero. The residuals from the IVGNR (8.53) are then 
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exactly the same as those from IV estimation of (8.51). For (8.54), we can 
use the result of Exercise 8.16 to see that the residuals can be written as 


y — XB + MwX(G - 8), (8.59) 


where 3 is the unrestricted IV estimator for (8.52). If all the regressors could 
serve as their own instruments, we would have MywX = O, and the last 
term in expression (8.59) would vanish, leaving just y — XB, the residuals 
from (8.52). But, when some of the regressors are not used as instruments, 
the two vectors of residuals are not the same. The analysis of the previous 
subsection shows clearly that the correct residuals to use for testing purposes 
are the ones from the two IVGNRs. 


The heart of the problem is that IV estimates are not obtained by minimizing 
the SSR, but rather the IV criterion function (8.30). The proper IV analog 
for the F statistic is a statistic based on the difference between the values of 
this criterion function evaluated at the restricted and unrestricted estimates. 
At the unrestricted estimates G, we obtain 


Q(B, y) = (y — XB)" Pw (y — XÊ). (8.60) 


Using the explicit expression (8.29) for the IV estimator, we see that (8.60) is 
equal to 


y' (I — PwX(X'PwX) |X") Pw (I- X(X'PwX)'X'Pw)y 
= y' (Pw — PwX(X'PwX)'X'Pw)y (8.61) 


= y (Pw — Peyx)y. 


If Q is now evaluated at the restricted estimates B, an exactly similar calcu- 
lation shows that 


Q(B. y) = y' (Pw = Ppwx, )y- (8.62) 
The difference between (8.62) and (8.61) is thus 


Q(B, y) — Q(B, y) = y '(Ppwx — Pry x,)y- (8.63) 


This is precisely the difference (8.57) between the SSRs of the two IVGNRs 
(8.53) and (8.54). Thus we can obtain an asymptotically correct test statistic 


by dividing (8.63) by any consistent estimate of the error variance o°. 


The only practical difficulty in computing (8.63) is that some regression pack- 
ages do not report the minimized value of the IV criterion function. However, 
this value is very easy to compute, since for any IV regression, restricted or 
unrestricted, it is equal to the explained sum of squares from a regression 
of the vector of IV residuals on the instruments W, as can be seen at once 
from (8.60). 
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Heteroskedasticity-Robust Tests 


The test statistics discussed so far are valid only under the assumptions that 
the error terms are serially uncorrelated and homoskedastic. The second of 
these assumptions can be relaxed if we are prepared to use an HCCME. If 
E(uu') = 2, where 9 is a diagonal, n x n matrix, then it can be readily 
seen from (8.32) that the asymptotic covariance matrix of n!/ 2( Bry — Bo) is 


zÍ —1 
( plim 1X'PwX] ( plim 1 X'Pw QPwX) ( plim 1xX'PwX] . (8.64) 


n— Co n—> oo n—> oo 


Not surprisingly, this looks very much like expression (5.33) for OLS esti- 
mation, except that PwX replaces X, and (8.64) involves probability limits 
rather than ordinary limits because the matrices X, and possibly also W, are 
now assumed to be stochastic. 


It is not difficult to estimate the asymptotic covariance matrix (8.64). The 
outside factors can be estimated consistently in the obvious way, and the 
middle factor can be estimated consistently by using the matrix 


1X! Pw OPwX, 


where 2 is an n x n diagonal matrix, the t*® diagonal element of which is 
equal to &?, the square of the tt? IV residual. In practice, since the factors 
of n are needed only for asymptotic analysis, we will use the matrix 


Varn (Biv) = (X PeX) X Pw QPywX(X'PwX)y! (8.65) 


to estimate the covariance matrix of Bry. This covariance matrix estimator 
has exactly the same form as the HCCME (5.39) for the OLS case. The only 
difference is that PwX replaces X. 


Once (8.65) has been calculated, we can compute Wald tests that are robust 
to heteroskedasticity of unknown form. We simply use (8.47) for a test of a 
single linear restriction, or (8.48) for a test of two or more restrictions, with 
(8.65) replacing the ordinary covariance matrix estimator. Alternatively, we 
can use the IV variant of the HRGNR introduced in Section 6.8. To obtain 
this variant, all we need do is to use PwX in place of X in (6.90); see 
Exercise 8.20. Of course, it must be remembered that all these tests are 
based on asymptotic theory, and there is good reason to believe that this 
theory may often provide a poor guide to their performance in finite samples. 
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8.6 Testing Overidentifying Restrictions 


The degree of overidentification of an overidentified linear regression model 
is defined to be | — k, where, as usual, l is the number of instruments, and 
k is the number of regressors. Such a model implicitly incorporates l — k 
overidentifying restrictions. These arise because the generalized IV estimator 
implicitly uses only k effective instruments, namely, the k columns of PwX. 
It does this because it is not possible, in general, to solve the | moment 
conditions (8.11) for only k unknowns. 


In order for a set of instruments to be valid, a sufficient condition is (8.13), 
according to which the error term u, has mean zero conditional on W;, the 
l-vector of current instruments. When this condition is not satisfied, the 
IV estimator risks being inconsistent. But, if we use for estimation only the 
k effective instruments in the matrix PwX, it is only those k instruments 
that need to satisfy condition (8.13). Let W* be an n x (J — k) matrix 
of extra instruments such that S(W) = 8(PwX,W*%). This means that 
the /-dimensional span of the full set of instruments is generated by linear 
combinations of the effective instruments, PwX, and the extra instruments, 
W*. The overidentifying restrictions require that the extra instruments should 
also satisfy (8.13). Unlike the conditions for the effective instruments, the 
overidentifying restrictions can, and always should, be tested. 


The matrix W* is not uniquely determined, but we will see in a moment that 
this does not matter. For any specific choice of W*, what we wish to test is 
the set of conditions 

E(W,;* uz) = 0. (8.66) 


Although we do not observe the uz, we can estimate the vector u by the vector 
of IV residuals ù. Thus, in order to make our test operational, we form the 
sample analog of condition (8.66), which is 


(W*)'a, (8.67) 


ale 


and check whether this quantity is significantly different from zero. 


The model we wish to test is 
y=XB+u, u-~IID(0,07I), E(W'u) =0. (8.68) 


Testing the overidentifying restrictions implicit in this model is equivalent to 
testing it against the alternative model 


y=XB+W*yt+u, u~IID(0,07I), E(W'u) =0. (8.69) 


This alternative model is constructed in such a way that it is just identified: 
There are precisely | coefficients to estimate, the k elements of B and the 
| — k elements of y, and there are precisely l instruments. 
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To see why testing (8.68) against (8.69) also tests whether the quantity (8.67) 
is significantly different from zero, consider the numerator of the artificial 
IVGNR F test for (8.68) against (8.69). Under the null hypothesis, the generic 
form of this numerator is given by (8.58). For present purposes, the ma- 
trix X, of regressors in the restricted regression becomes X, and the matrix 
X in (8.58) is replaced by |X W*], the regressor matrix for (8.69). Since 
Pw|X W*] = |[PwX W*], and the span of the columns of this matrix is 
just S(W), it follows that the first of the two projection matrices in (8.58) 
becomes simply Pw. The second projection matrix is Pp,,x. One possible 
choice for W* would be a matrix the columns of which were all orthogonal 
to those of PwX. Such a matrix could be constructed from an arbitrary W* 
by multiplying it by Mp,,x. With such a choice, the orthogonality of PwxX 
and W* means that, by the result in Exercise 2.16, 


Pw — Pax = Py-. 
The numerator of the F statistic is thus just 
ul Pwu = ul W* ((W*) Ww) (W*) a. 


Since the middle matrix on the right-hand side of this equation is positive 
definite by construction, it can be seen that the F test is testing whether 
(8.67) is significantly different from zero. 


As we claimed above, implementing a test of the overidentifying restrictions 
does not require a specific choice of W*, and in fact it does not require us to 
construct W™ explicitly at all. To see why, consider the two IVGNRs for the 
test, evaluated at Gry. They are 


û = PwXby, + residuals, and (8.70) 
û = PwXb; + W*b + residuals. (8.71) 


The numerator of the F statistic is the difference of the two SSRs, which 
is equal to minus the difference of the two explained sums of squares. The 
explained sum of squares from (8.70) is zero, because the regressand is or- 
thogonal to the regressors. The explained sum of squares from (8.71) is the 
same as that from the regression 


au = Wb + residuals, (8.72) 


because, however W* is chosen, we always have S(PwX, W*) = 8(W). The 
test statistic is therefore equal to the explained sum of squares from (8.72) 
divided by a consistent estimate of the error variance. One such estimate 
is n-'u'tu. Thus one way to compute the test statistic is to regress the 
residuals & from IV estimation of the original model (8.68) on the full set of 


instruments, and use n times the uncentered R? from this regression as the 
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test statistic. If (8.68) is correctly specified, the asymptotic distribution of 
the statistic is x? (l — k). 


Another very easy way to test the overidentifying restrictions is to use a test 
statistic based on the IV criterion function. Since the alternative model (8.69) 
is just identified, the minimized IV criterion function for it is exactly zero. 
To see this, note that, for any just identified model, the IV residuals are 
orthogonal to the full set of instruments by the moment conditions (8.11) used 
with just identified models. Therefore, when the criterion function (8.30) is 
evaluated at the IV estimates Br it becomes @' Pw t, which is zero because 
of the orthogonality of W and u. Thus an appropriate test statistic is just 
the criterion function Q(Grv, y) for the original model (8.68), divided by the 
estimate of the error variance from this same model. A test based on this 
statistic is often called a Sargan test, after Sargan (1958). The test statistic 
is numerically identical to the one based on (8.72), as readers are asked to 
show in Exercise 8.21. 


Although (8.69) is a simple enough model, it actually represents two con- 
ceptually different alternatives, because there are two situations in which the 
“true” parameter vector ~y in (8.69) could be nonzero. One possibility is 
that the model (8.68) is correctly specified, but some of the instruments are 
asymptotically correlated with the error terms and are therefore not valid 
instruments. The other possibility is that (8.68) is not correctly specified, 
and some of the instruments (or, possibly, other variables that are correlated 
with them) have incorrectly been omitted from the regression function. In 
either case, the overidentification test statistic will lead us to reject the null 
hypothesis whenever the sample size is large enough. 


Even if we do not know quite how to interpret a significant value of the over- 
identification test statistic, it is always a good idea to compute it. If it is 
significantly larger than it should be by chance under the null hypothesis, 
one should be extremely cautious in interpreting the estimates, because it is 
quite likely either that the model is specified incorrectly or that some of the 
instruments are invalid. 


8.7 Durbin-Wu-Hausman Tests 


In many cases, we do not know whether we actually need to use instrumental 
variables. For example, we may suspect that some variables are measured 
with error, but we may not know whether the errors are large enough to 
cause enough inconsistency for us to worry about. Or we may suspect that 
certain explanatory variables are endogenous, but we may not be at all sure of 
our suspicions, and we may not know how much inconsistency would result if 
they were justified. In such a case, it may or may not be perfectly reasonable 
to employ OLS estimation. 
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If the regressors are valid instruments, then, as we saw in Section 8.3, they 
are also the optimal instruments. Consequently, the OLS estimator, which 
is consistent in this case, is preferable to an IV estimator computed with 
some other valid instrument matrix W. In view of this, it would evidently 
be very useful to be able to test the null hypothesis that the error terms 
are uncorrelated with all the regressors against the alternative that they are 
correlated with some of the regressors, although not with the instruments W. 
In this section, we discuss a simple procedure that can be used to perform 
such a test. This procedure dates back to a famous paper by Durbin (1954), 
and it was subsequently extended by Wu (1973) and Hausman (1978). We 
will therefore refer to all tests of this general type as Durbin-Wu-Hausman 
tests, or DWH tests. 


The null and alternative hypotheses for the DWH test can be expressed as 


Ho: y= X@B+u, u-~TID(0,071), E(X'u)=0, and (8.73) 
Ay: y=XB+u, u-IID(0,o71), E(W'u) =0. (8.74) 
Under Hy, the IV estimator Bry is consistent, but the OLS estimator Bois is 
not. Under Ho, both are consistent. Thus, plim (Gry — Bors) is zero under 
the null and nonzero under the alternative. The idea of the DWH test is to 
check whether the difference Gry — Gots is significantly different from zero in 


the available sample. This difference, which is sometimes called the vector of 
contrasts, can be written as 


Bi — Gos = (X'PwX) X Pwy — (XTX) X'y. (8.75) 


Expression (8.75) is not very useful as it stands, but it can be converted into 
a much more useful expression by means of a trick that is often useful in 
econometrics. We pretend that the first factor of Byy is common to both 
estimators, and take it out as a common factor. This gives 


Bw — Bois = (X'PwX)!(X'Pwy — X'PwX(X'Xy X'y). 


Now we can find some genuinely common factors in the two terms of the 
rightmost factor of this expression. Taking them out yields 


Bi — Bots = (X'PwX) X" Pw(1- X(XTX) 1X )y 
= (X'PwX)'X'PwMxy. (8.76) 


The first factor in expression (8.76) is a positive definite matrix, by the iden- 
tification condition. Therefore, testing whether Brv — Bots is significantly 
different from zero is equivalent to testing whether the vector X 'Pw Mx yy is 
significantly different from zero. 

Under Ho, the preferred estimation technique is OLS, and the OLS residu- 
als are given by the vector Mxy. Therefore, we wish to test whether the 
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k columns of the matrix PwX are orthogonal to this vector of residuals. Let 
us partition the matrix of regressors X as in (8.38), so that X = |Z Y], 
where the kı columns of Z are included in the matrix of instruments W, and 
the kə = k — kı columns of Y are treated as potentially endogenous. By con- 
struction, OLS residuals are orthogonal to all the columns of X, in particular 
to those of Z. For these regressors, there is therefore nothing to test: The 
relation 
Z' Pw Mxy = Z'Mxy =0 


holds identically, because PwZ = Z and MxZ = O. The test is thus 
concerned only with the kz elements of Y' Pw Mx y, which will not in general 
be identically zero, but should not differ from it significantly under Ho. 


The easiest way to test whether Y' Pw Mxy is significantly different from 
zero is to use an F test for the kz restrictions 6 = 0 in the OLS regression 


y = XB + PwYô+ u. (8.77) 


The OLS estimates of ô from (8.77) are, by the FWL Theorem, the same as 
those from the FWL regression of Mxy on Mx PwY, that is, 


ô = (Y 'Pw Mx PwY ) +Y ' Pw Mxy. 


Since the inverted matrix is positive definite, we see that testing whether 
ô = 0 is equivalent to testing whether Y ' Pw Mxy = 0, as desired. This 
conclusion could have been foreseen by considering the threefold orthogonal 
decomposition that is implicitly performed by an F test; recall Section 4.4. 
The DWH test can also be implemented by means of another F test, which 
yields exactly the same test statistic; see Exercise 8.22 for details. 


The F test based on (8.77) has kọ and n — k — kə degrees of freedom. Under 
Ho, if we assume that X and W are not merely predetermined but also 
exogenous, and that the error terms u are multivariate normal, the F statistic 
will indeed have the F (k2, n—k-— k2) distribution. Under Hp as it is expressed 
in (8.73), its asymptotic distribution is F (k2, 00), and kə times the statistic is 
asymptotically distributed as x?(ke). 


If the null hypothesis (8.73) is rejected, we are faced with the same sort of 
ambiguity of interpretation as for the test of overidentifying restrictions. One 
possibility is that at least some columns of Y are indeed endogenous, but in 
such a way that the alternative model (8.74) is correctly specified. But we can 
equally well take (8.77) literally as a model with exogenous or predetermined 
regressors. In that case, the nature of the misspecification of (8.73) is not that 
Y is endogenous, but rather that the linear combinations of the instruments 
given by the columns of PwY have explanatory power for the dependent 
variable y over and above that of X. Without further investigation, there is 
no way to choose between these alternative interpretations. 
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Tests Based on Vectors of Contrasts 


DWH tests are much more widely applicable than we have indicated so far. 
They can be used whenever there are two estimators, one of which, like Bw, 
is inefficient but consistent under relatively weak conditions, while the other, 
like Bors, is efficient, but only if the stronger conditions required for it to 
be consistent are satisfied. For example, for the panel data case discussed in 
Section 7.10, a DWH test can be used to see whether it is valid to employ the 
random-effects estimator rather than the less efficient fixed-effects estimator; 
see Hausman (1978) and Hausman and Taylor (1981) for details. 


In this case, and many others, it is convenient to base a test directly on the 
vector of contrasts, that is, the difference between the two vectors of estimates. 
Suppose we are trying to estimate a k-vector @ of which the true value is 8o. 
Let Og denote an efficient estimator, and let 0; denote an inefficient estimator 
that is consistent under weaker conditions. Under mild regularity conditions, 
an inefficient estimator is always asymptotically equal to an efficient estimator 
plus a random vector that is uncorrelated with the efficient estimator. We 
saw an example of this in Section 3.5 when we discussed the Gauss-Markov 
Theorem; see also Exercise 8.23. Thus, in a broad range of cases, we can write 


n4/2(6; — 00) = nt? (Ôg — 0o) + v, (8.78) 


where v is a random k-vector that is uncorrelated with n!/ (On — 00). This 
vector is asymptotically equal to n'/2 times the vector of contrasts, which is 
just Oi = Og. 


In this situation, a DWH test may be based on a quadratic form in the vector 
of contrasts and the inverse of an estimate of its covariance matrix. From 
(8.78) and the fact that v is uncorrelated with n‘/?(@% — 09), we see that 


Var(v) += Var (n'/? (6, —09)) — Var (n'/? (ĝe — 0o)). 


Whenever standard asymptotic results apply, the vector n!/ 26 — ôr) will be 
asymptotically normally distributed. Therefore, by Theorem 4.1, a suitable 
test statistic is 


(6; — On) '(Var(61) — Var(6z)) (61 — Ôx), (8.79) 


where Var(Ôr) and Var(6,) are consistent estimates of the covariance matrices 
of the two estimators. Tests based on quadratic forms like (8.79) are often 
called Hausman tests. 


A problem arises as to the degrees of freedom for the test statistic (8.79). 
As we have already seen, the DWH test based on regression (8.77) has k2 
degrees of freedom, where kz is the number of possibly endogenous variables 
on the right-hand side of equation (8.73). This is smaller than k, the dimen- 
sion of the vector B. A similar phenomenon occurs whenever the covariance 
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matrix Var(v) does not have full rank. It may be hard to check for such a 
phenomenon, since the rank of the difference between the estimates Var(01) 
and Var(@_) usually has full rank even if Var(v) does not. Worse, this differ- 
ence may or may not be guaranteed to be a positive definite matrix, in which 
case the statistic (8.79) cannot be used without modification. 


In some such cases, a test statistic can be based on a subvector of the vector 
of contrasts. This is what would have to be done if 0; were an IV estimator 
and Ôg were an OLS estimator. Then a DWH statistic of the form (8.79) 
would have to be based solely on the coefficients of the possibly endogenous 
variables. This would yield a Hausman test asymptotically equivalent to the 
F test based on regression (8.77) that we have already discussed. 


8.8 Bootstrap Tests 


The difficulty with using the bootstrap for models estimated by IV is that 
there is more than one endogenous variable. The bootstrap DGP must there- 
fore be formulated in such a way as to generate samples containing bootstrap 
realizations of both the main dependent variable y and the endogenous ex- 
planatory variables, which we denote by Y in the notation of (8.38). 


As we saw in Section 8.4, the single equation (8.38) is not a complete specifi- 
cation of a model. We can complete it in various ways, of which the easiest is 
to use equations (8.39) for i=1,...,k2. This introduces kz vectors 7;, each 
containing l parameters. In addition, we must specify the joint distribution 
of the error terms u in the equation for y and the v; in the equations for Y. 
If we use the notation of (8.44), we can write the reduced form equations for 
the endogenous explanatory variables in matrix form as 


Y = WII, + Və, (8.80) 


where IT is an l x k matrix, the columns of which are the 7; of (8.39), and 
V2 is an n x kp matrix of error terms, the columns of which are the v; of (8.39). 
It is convenient to group all the error terms together into one matrix, and so 
we define the n x (k2 +1) matrix V as [u V2]. Note that this matrix V is 
not the same as the one used in Section 8.4. If V; denotes a typical row of V, 
then we will assume that 

B(V,V,") = 5, (8.81) 


where X is a (kg+1) x (ko +1) covariance matrix, the upper left-hand element 
of which is o°, the variance of the error terms in u. Together, (8.38), (8.80), 
and (8.81) constitute a model that, although not quite fully specified (because 
the distribution of the error terms is not stated), can serve as a basis for various 
bootstrap procedures. 


Suppose that we wish to develop bootstrap versions of the tests considered 
in Section 8.5, where the null and alternative hypotheses are given by (8.51) 
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and (8.52), respectively. For concreteness, we consider the test implemented 
by use of the IVGNRs (8.53) and (8.54), although the same principles apply to 
other forms of test, such as the asymptotic t and Wald tests (8.47) and (8.48), 
or tests based on the IV criterion function. Note that we now have two 
different partitions of the matrix X of explanatory variables. First, there is the 
partition X = [Z Y], in which Z contains the exogenous or predetermined 
variables, and Y contains the endogenous ones that are modeled explicitly 
by (8.80). Then there is the partition X = [X, Xə], in which we separate 
the variables X; included under the null from the variables Xə that appear 
only under the alternative. In general, these two partitions are not related. 
We can expect that, in most cases, some columns of Y are contained in X, 
and some in Xo, and similarly for Z. 


The first step, as usual, is the estimation by IV of the model (8.51) that 
represents the null hypothesis. From this we obtain the constrained parameter 
estimates 8, and residuals ù. Next, we formulate and run the two IVGNRs 
(8.53) and (8.54), evaluated at 3, = 31, and compute the F statistic. Then, in 
order to estimate all the other parameters of the extended model, we run the 
k2 reduced form regressions represented by (8.80), obtaining OLS estimates 
and residuals that we denote respectively by II, and Vs. We will write V to 
denote [ù V3]. 

For the bootstrap DGP, suppose first that all the instruments are exogenous. 
In that case, they are used unchanged in the bootstrap DGP. At this point, 
we must choose between a parametric and a semiparametric bootstrap. Since 
the latter is slightly easier, we discuss it first. In most cases, X and W will 
include a constant, and the residuals ù and V will be centered. If not, as 
we discussed in Section 4.6, they must be centered before proceeding further. 
Because we wish the bootstrap DGP to retain the contemporaneous covariance 
structure of V, the bootstrap error terms will be drawn as complete rows V,* 
by resampling entire rows of V. In this way, we draw our bootstrap error 
terms from the joint empirical distribution of the V;. With models estimated 
by least squares, it is desirable to rescale residuals before they are resampled; 
again see Section 4.6. Since the columns of V> are least squares residuals, it 
is probably desirable to rescale them. However, there is no justification for 
rescaling the vector ù. 


For the parametric bootstrap, we must actually estimate X. The easiest way 
to do so is to form the matrix 


y=ivy'v. 
n 


Since By and I 9 are consistent estimators, it follows that V is also consistent 
for V. We can then apply a law of large numbers to each element of £ in 
order to show that it converges as n — co to the corresponding element of 
the true X. The row vectors of parametric bootstrap error terms V* will 
then be independent drawings from the multivariate normal distribution with 
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mean zero and covariance matrix X. In order to make these drawings, the 
easiest method is to form a (k2+1) x (k2 +1) matrix A such that AA’ = $. 
Usually, A is chosen to be upper or lower triangular; recall the discussion of 
the multivariate normal distribution in Section 4.3. Then, if a random number 
generator is used to draw (kz + 1)-vectors v* from N(0,1), we see that Av* 


is a drawing from N(0, Ñ), as desired. 


The rest of the implementation is the same for both the parametric and the 
semiparametric bootstrap. For each bootstrap replication, the endogenous 
explanatory variables are first generated by the bootstrap reduced form 


Y* = WIT, + V3, (8.82) 


where IT and V," are just the matrices IT and V* without their first columns. 
Then the main dependent variable is generated so as to satisfy the null hypo- 
thesis: . 

y = XBi + u". 


Here the star on Xj indicates that some of the regressors in X,; may be 
endogenous, and so will have been simulated using (8.82). The bootstrap 
error terms u* are just the first column of V*. For each bootstrap sample, the 
two IVGNRs are estimated, and a bootstrap F statistic is computed. Then, 
as usual, the bootstrap P value is the proportion of bootstrap F statistics 
greater than the F statistic computed from the original data. 


Bootstrapping tests of overidentifying restrictions follows the same lines. Since 
the null hypothesis for such a test is just the model being estimated, the only 
extra work needed is the estimation of the reduced form model (8.80) for the 
endogenous explanatory variables. Bootstrap error terms are generated by a 
parametric or semiparametric bootstrap, and the residuals from the IV esti- 
mation using the bootstrap data are regressed on the full set of instruments. 
The simplest test statistic is just the nR? from this regression. 


It is particularly easy to bootstrap DWH tests, because for them the null 
hypothesis is that none of the explanatory variables is endogenous. It is 
therefore quite unnecessary to model them by (8.80), and bootstrap data are 
generated as for any other model to be estimated by least squares. Note that, 
if we are prepared to make the strong assumptions of the classical normal 
linear model under the null, the bootstrap is quite unnecessary, because, as 
we saw in the previous section, the test statistic has a known finite-sample 
distribution. 


If some of the non-endogenous explanatory variables are lagged dependent 
variables, or lags of the endogenous explanatory variables, bootstrap samples 
must be generated recursively, as for the case of the ordinary regression model 
with a lagged dependent variable, for which the recursive bootstrap DGP 
was (4.66). Especially if lags of endogenous explanatory variables are involved, 
this may become quite complicated. 
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It is worth issuing a warning that, for a number of reasons well beyond the 
scope of this chapter, the bootstrap method outlined above cannot be expected 
to work as well as the bootstrap methods for regression models discussed 
in earlier chapters. Some reasons for this are discussed in Dufour (1997). 
Bootstrapping of simultaneous equations models is still an active topic of 
research, and new methods are constantly being developed. 


8.9 IV Estimation of Nonlinear Models 


In this section, we extend the results of this chapter beyond the linear model 
(8.10) dealt with up to this point by very briefly discussing instrumental 
variables estimation of the nonlinear regression model 


y=2(B)+u, E(uu') = 07, (8.83) 


where the notation is that of Chapter 6. Some of the results that we will 
obtain are formally the same as ones previously obtained in Section 6.2 in the 
context of MM estimation. However, in contrast to what we assumed there, 
we now assume that at least some of the variables on which the regression 
functions x;(8) depend are not contained in whatever information sets, Qg, 
with respect to which the error terms are innovations. This leads the error 
terms to be correlated with the regression functions and at least some of their 
derivatives. In consequence, for essentially the same reasons as in the linear 
case, the NLS parameter estimates will be inconsistent. 


If the vector 6 in (8.83) is k-dimensional, consistent MM estimates based 
on an n x k matrix of exogenous or predetermined instruments, W, can be 
obtained by solving the moment conditions (6.10): 


W'(y—2x(B)) =0. 


By using arguments similar to those employed in Sections 6.3 and 8.3, it can 
be shown that the optimal instruments, by the criterion of the asymptotic 
variance, are given by Xo = X(@o). Here, Bo denotes the true parameter 
vector, and the n x k matrix X(8), is defined, as in Section 6.2, to be the 
matrix of partial derivatives of the nonlinear regression functions with respect 
to the parameters. As in (8.18), the bar signifies expectations conditional on 
the relevant information sets: The tt row of Xo is E(X;(8o) | 9+), while the 
tt! row of Xo is just X;(Go). 


If we restrict our attention to instruments that can be expressed as linear 
combinations of the / columns of a given instrument matrix W, with | > k, 
the analog of the result that the optimal instruments in this class are given 
by (8.23) is that they are given by Py Xo. Since Bo is not known, it is 
convenient to use the same trick as that used for nonlinear least squares by 
solving the set of moment conditions 


X'(8)Pw(y — x(B)) =0. (8.84) 
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These moment conditions are the analog in the IV context of conditions (6.27) 
for the least-squares case. If it exists, the solution to equations (8.84) is the 
nonlinear instrumental variables estimator, or NLIV estimator. 


The NLIV estimates also minimize the nonlinear IV criterion function 


Q(B,y) = (y — 2(8)) "Pw(y — «(8)), (8.85) 


which generalizes the ordinary IV criterion function (8.30) in the obvious way. 
As usual, the first-order conditions for minimizing (8.85) are equivalent to the 
moment conditions (8.84), but it is usually easier to minimize Q(G, y) than it 
is to solve the moment conditions directly. In contrast to the situation with 
linear models, minimizing (8.85) is, in general, not equivalent to replacing the 
current endogenous regressors Y by Pw Y and then minimizing the sum of 
squared residuals. The two procedures will be equivalent only if æ(8) is a 
linear function of Y. Thus, even though it is quite common to refer to NLIV 
estimation as nonlinear two-stage least squares, it is incorrect and misleading 
to do so, because NLIV estimates are never actually computed in two stages. 


The strong asymptotic identification condition for the NLIV estimator is that 
the matrix Sytw(Swtw) ‘Swtx, is positive definite, where Sx,ryw and 
Swtx, are defined, analogously to Sxtw and Syrx, as plimn~1X)' W 
and plimn~!W' Xo, respectively. As with nonlinear models estimated by 
least squares, the strong asymptotic identification condition is sufficient, but 
not necessary, for ordinary asymptotic identification; see Section 6.2. 


If the strong asymptotic identification condition is satisfied, the NLIV esti- 
mator can be shown to be consistent by the usual sort of reasoning. It is also 
asymptotically normal, and it satisfies the equation 


n/2(8nirv — Bo) = (nX PwXo) tn? Xo Pwu, (8.86) 


from which it follows that the asymptotic covariance matrix is 


Var ( plim nt/?(Ĝnurv = Bo)) = plim of (n~1Xo Pw Xoy t, (8.87) 


n— oo n— oo 


where øĝ is the true error variance. We previously obtained this result in 
(6.26) under stronger assumptions about the error terms. Based on (8.87), a 
suitable estimator of the actual covariance matrix is 


Var(Ênuıv) = 6? (X'PywX)}, (8.88) 
where X = X(Êxnuv), and ô? is 1/n times the SSR from IV estimation 
of regression (8.83). Readers may find it instructive to compare (8.88) with 


expression (8.34), the covariance matrix of the generalized IV estimator for a 
linear regression model. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


8.10 Final Remarks 345 


The nonlinear version of the IVGNR is a simple extension of (8.49). It can 
be written as 
y — x(B) = PwX(GB)b + residuals. (8.89) 


In Exercise 8.24, readers are invited to show that this artificial regression has 
the properties necessary for its use in hypothesis testing, and to develop a 
heteroskedasticity-robust version of it. Hypothesis testing can also be carried 
out on the basis of the nonlinear IV criterion function (8.85), in precisely the 
same way as for linear models. 


Tests of overidentifying restrictions and DWH tests for nonlinear models are 
likewise simple and obvious extensions of those for linear models. The mini- 
mized value of (8.85), when divided by any consistent estimate of 0”, is asymp- 
totically distributed as x?(l — k) and may be used to test the overidentifying 
restrictions. Although bootstrapping of nonlinear models estimated by NLIV 
can be carried out just as in Section 8.8, with the endogenous explanatory 
variables generated by the set of linear equations (8.80), the requirement that 
these equations should be linear may often be uncomfortably strong. In such 
cases, it would be unwise in the present state of the art to make any specific 
recommendations. 


8.10 Final Remarks 


Although it is formally very similar to other MM estimators that we have 
studied, the IV estimator does involve several important new concepts. These 
include the idea of an instrumental variable, the notion of forming a set of 
instruments optimally as weighted combinations of a larger number of instru- 
ments when that number exceeds the number of parameters, and the concept 
of overidentifying restrictions. 


The optimality of the generalized IV estimator depends critically on the fairly 
strong assumption that the error terms are homoskedastic and serially uncor- 
related. When this assumption is relaxed, it may be possible to obtain MM 
estimators that are more efficient than the GIV estimator. These “generalized 
method of moments” estimators will be the topic of the next chapter. 


8.11 Exercises 


8.1 Consider a very simple consumption function, of the form 
ci = bi + Bays +u, uj ~ IID(0, 0°), 


where c; is the logarithm of consumption by household i, and y; is the per- 
manent income of household i, which is not observed. Instead, we observe 
current income y;, which is equal to y; +v;, where v; ~ IID(0, ra is assumed 
to be uncorrelated with y; and u;. Therefore, we run the regression 


Ci = b1 + Pay + ui. 
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8.2 


8.3 


8.4 


8.5 


8.6 


8.7 


8.8 


8.9 


8.10 
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Under the plausible assumption that the true value G29 is positive, show that 
Yi is negatively correlated with u;. Using this result, evaluate the plim of the 
OLS estimator 62, and show that this plim is less than (99. 


Consider the simple IV estimator (8.12), computed first with an n x k matrix 
W of instrumental variables, and then with another n x k matrix WJ, where 
J is a k x k nonsingular matrix. Show that the two estimators coincide. Why 
does this fact show that (8.12) depends on W only through the orthogonal 
projection matrix Pw? 


Show that, if the matrix of instrumental variables W is n x k, with the same 
dimensions as the matrix X of explanatory variables, then the generalized 
IV estimator (8.29) is identical to the simple IV estimator (8.12). 


Show that minimizing the criterion function (8.30) with respect to 8 yields 
the generalized IV estimator (8.29). 


Under the usual assumptions of this chapter, including (8.16), show that the 
probability limit of 


+ Q(B0,y) = 4(y — X80)" Pw (y — Xo) 


is zero if y= X89 + u. Under the same assumptions, along with the asymp- 
totic identification condition that Sxytw(Swtw) ‘Swrtx has full rank, 
show further that plimn~ 1Q(B,y ) is strictly positive for B 4 Bo. 


Under SS umpnon (8.16) and the asymptotic identification condition that 
Sxtw(Swtw) Swrx has full rank, show that the GIV estimator Bry is 
consistent by explicitly computing the probability limit of the estimator for 
a DGP such that y = X60 + u. 


Suppose that you can apply a central limit theorem to the vector n` 12W lu, 
with the result that it is asymptotically multivariate normal, with mean 0 
and covariance matrix oe 33). Use (8.32) to demonstrate explicitly that, if 
y = XßBo + u, then n! /2(Âĝiy — Bo) is asymptotically normal with mean 0 
and covariance matrix (8.17). 


Suppose that Wı and W2 are, respectively, n x lı and n x lg matrices of 
instruments, and that W 2 consists of W, plus lə — lı additional columns. 
Prove that the generalized IV estimator using W2 is asymptotically more 
efficient than the generalized IV estimator using W1. To do this, you need to 
show that the matrix (X' Pw, x) (X'Pw, X)! is positive semidefinite. 
Hint: see Exercise 3.8. 


Show that the simple IV estimator defined in (8.41) is unbiased when the data 
are generated by (8.40) with oy, = 0. Interpret this result. 


Use the DGP (8.40) to generate at least 1000 sets of simulated data for æ 
and y with sample size n = 10, using normally distributed error terms and 
parameter values Cu = ov = 1, 79 = 1, 6o = 0, and p = 0.5. For the 
exogenous instrument w, use independent drawings from the standard normal 
distribution, and then rescale w so that wlw is equal to n, rather than 1 as 
in Section 8.4. 
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8.11 


8.12 


8.13 


8.14 


8.15 


8.16 


8.17 


8.18 


For each simulated data set, compute the IV estimator (8.41). Then draw the 
empirical distribution of the realizations of the estimator on the same plot 
as the CDF of the normal distribution with mean zero and variance o4 / nme. 
Explain why this is an appropriate way to compare the finite-sample and 
asymptotic distributions of the estimator. 


In addition, for each simulated data set, compute the OLS estimator, and plot 
the EDF of the realizations of this estimator on the same axes as the EDF of 
the realizations of the IV estimator. 


Redo Exercise 8.10 for a sample size of n = 100. If you have enough computer 
time available, redo it yet again for n = 1000, in order to see how quickly or 
slowly the finite-sample distribution tends to the asymptotic distribution. 


Redo the simulations of Exercise 8.10, for n = 10, generating the exogenous 
instrument w as follows. For the first experiment, use independent drawings 
from the uniform distribution on [—1,1]. For the second, use drawings from 
the AR(1) process wy = awy—1 + €t, where wo = 0, a = 0.8, and the £+ are 
independent drawings from N(0,1). In all cases, rescale w so that w'w =n. 
To what extent does the empirical distribution of Bry appear to depend on 
the properties of w? What theoretical explanation can you think of for your 
results? 


Include one more instrument in the simulations of Exercise 8.10. Continue 
to use the same DGP for y and æ, but replace the simple IV estimator by 
the generalized one, based on two instruments w and z, where z is generated 
independently of everything else in the simulation. See if you can verify the 
theoretical prediction that the overidentified estimator computed with two 
instruments is more biased, but has thinner tails, than the just identified 
estimator. 


Repeat the simulations twice more, first with two additional instruments and 
then with four. What happens to the distribution of the estimator as the 
number of instruments increases? 


Verify that B1v is the OLS estimator for model (8.10) when the regressor 
matrix is X = [Z Y] = WIT, with the matrix V in (8.44) equal to O. Is 
this estimator consistent? Explain. 


Verify, by use of the assumption that the instruments in the matrix W are 
exogenous or predetermined, and by use of a suitable law of large numbers, 
that all the terms in (8.45) that involve V do not contribute to the probability 
limit of (8.45) as the sample size tends to infinity. 


Show that the vector of residuals obtained by running the IVGNR (8.49) with 
B = Bis equal to y — XBjy + Mw X (B1v = B). Use this result to show that 
ó? the estimate of the error variance given by the IVGNR, is consistent for 
the error variance of the underlying model (8.10) if É is root-n consistent. 
Prove that expression (8.56) is equal to expression (8.57). Hint: Use the facts 
that Ppy x, Xı = Pw X, and Ppy x PPwX, = Pp, xX- 

Show that kg times the artificial F statistic from the pair of IVGNRs (8.53) 
and (8.54) is asymptotically equal to the Wald statistic (8.48), using reasoning 
similar to that employed in Section 6.7. Why are these two statistics not 
numerically identical? Show that the asymptotic equality does not hold if 
different matrices of instruments are used in the two [VGNRs. 
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8.19 


8.20 


8.21 


8.22 


8.23 


8.24 


Instrumental Variables Estimation 


Sketch a proof of the result that expression (8.58), 


= uU 
2 
To 


1 
T(Ppyx — Pew x,)U, 


is asymptotically distributed as x? (k2) when the vector u is IID(0, ofl) and 
is asymptotically uncorrelated with the instruments W. Here kg = k — ky, 
where X has k columns and X; has kı columns. 


The IV variant of the HRGNR (6.90), evaluated at B = É, can be written as 


L=P Ú` !PwXb + residuals, (8.90) 


UPwX 
where z is an n-vector of which every component equals 1, and U is ann x n 


diagonal matrix with pn diagonal element equal to the t* element of the 
vector y — XB. 


Verify that this artificial regression possesses all the requisite properties for 
hypothesis testing, namely, that: 


e The regressand in (8.90) is orthogonal to the regressors when É = Bry; 


e The estimated OLS covariance matrix from (8.90) evaluated at B = Biv is 
equal to n/(n — k) times the HCCME Varp (Brv) given by (8.65); 


e The HRGNR (8.90) allows one-step estimation: The OLS parameter esti- 
mates b from (8.90) are such that Bry = B+ b. 


Show that nR? from the modified IVGNR (8.72) is equal to the Sargan test 
statistic, that is, the minimized IV criterion function for model (8.68) divided 
by the IV estimate of the error variance for that model. 


Consider the following OLS regression, where the variables have the same 
interpretation as in Section 8.7 on DWH tests: 


y = XB + MwywYC+u. (8.91) 


Show that an F test of the restrictions ¢ = 0 in (8.91) is numerically identical 
to the F test for 6 = 0 in (8.77). Show further that the OLS estimator of 8 
from (8.91) is identical to the estimator Bry obtained by estimating (8.74) 
by instrumental variables. 


Show that the difference between the generalized IV estimator Bry and the 
OLS estimator BoLs; for which an explicit expresion is given in equation 
(8.76), has zero covariance with Bots itself. For simplicity, you may treat 
the matrix X as fixed. 


Using the same methods as those in Sections 6.5 and 6.6, show that the 
nonlinear version (8.89) of the IVGNR satisfies the three conditions, analogous 
to those set out in Exercise 8.20, which are necessary for the use of the 
IVGNR in hypothesis testing. What is the nonlinear version of the IV variant 
of the HRGNR? Show that it, too, satisfies the three conditions under the 
assumption of possibly heteroskedastic error terms. 
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8.25 


8.26 


8.27 


8.28 


The data in the file money.data are described in Exercise 7.14. Using these 
data, estimate the model 


me = 31 + Bore + B3ye + Bame—1 + B5mz—o2 + ut (8.92) 


by OLS for the period 1968:1 to 1998:4. Then perform a DWH test for the 
hypothesis that the interest rate, rz, can be treated as exogenous, using rz—1 
and r+_2 as additional instruments. 


Estimate equation (8.92) by generalized instrumental variables, treating rz 
as endogenous and using ry_1 and rz—2 as additional instruments. Are the 
estimates much different from the OLS ones? Verify that the IV estimates 
may also be obtained by OLS estimation of equation (8.91). Are the reported 
standard errors the same? Explain why or why not. 


Perform a Sargan test of the overidentifying restrictions for the IV estimation 
you performed in Exercise 8.26. How do you interpret the results of this test? 


The file demand-supply.data contains 120 artificial observations on a demand- 
supply model similar to equations (8.06)—(8.07). The demand equation is 


qt = b1 + b2Xt2 + 63X13 + ype + ut, (8.93) 


where qt is the log of quantity, pz is the log of price, Xz2 is the log of income, 
and X73 is a dummy variable that accounts for regular demand shifts. 


Estimate equation (8.93) by OLS and 2SLS, using the variables X+4 and Xi5 
as additional instruments. Does OLS estimation appear to be valid here? 
Does 2SLS estimation appear to be valid here? Perform whatever tests are 
appropriate to answer these questions. 


Reverse the roles of q+ and p+ in equation (8.93) and estimate the new equation 
by OLS and 2SLS. How are the two estimates of the coefficient of q in the 
new equation related to the corresponding estimates of y from the original 
equation? What do these results suggest about the validity of the OLS and 
2SLS estimates? 
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Chapter 9 


The Generalized 
Method of Moments 


9.1 Introduction 


The models we have considered in earlier chapters have all been regression 
models of one sort or another. In this chapter and the next, we introduce 
more general types of models, along with a general method for performing 
estimation and inference on them. This technique is called the generalized 
method of moments, or GMM, and it includes as special cases all the methods 
we have so far developed for regression models. 


As we explained in Section 3.1, a model is represented by a set of DGPs. 
Each DGP in the model is characterized by a parameter vector, which we 
will normally denote by @ in the case of regression functions and by @ in the 
general case. The starting point for GMM estimation is to specify functions, 
which, for any DGP in the model, depend both on the data generated by that 
DGP and on the model parameters. When these functions are evaluated at 
the parameters that correspond to the DGP that generated the data, their 
expectation must be zero. 


As a simple example, consider the linear regression model y = X;G + uz. 
An important part of the model specification is that the error terms have 
mean zero. These error terms are unobservable, because the parameters 3 
of the regression function are unknown. But we can define the residuals 
utl B) = y: — Xb as functions of the observed data and the unknown model 
parameters, and these functions provide what we need for GMM estimation. 
If the residuals are evaluated at the parameter vector Bo associated with the 
true DGP, they have mean zero under that DGP, but if they are evaluated at 
some 3 Æ Bo, they do not have mean zero. In Chapter 1, we used this fact 
to develop a method of moments (MM) estimator for the parameter vector 3 
of the regression function. As we will see in the next section, the various 
GMM estimators of 8 include as a special case the MM (or OLS) estimator 
developed in Chapter 1. 


In Chapter 6, when we dealt with nonlinear regression models, and again in 
Chapter 8, we used instrumental variables along with residuals in order to 
develop MM estimators. The use of instrumental variables is also an essential 
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aspect of GMM, and in this chapter we will once again make use of the various 
kinds of optimal instruments that were useful in Chapters 6 and 8 in order 
to develop a wide variety of estimators that are asymptotically efficient for a 
wide variety of models. 


We begin by considering, in the next section, a linear regression model with 
endogenous explanatory variables and an error covariance matrix that is not 
proportional to the identity matrix. Such a model requires us to combine 
the insights of both Chapters 7 and 8 in order to obtain asymptotically effi- 
cient estimates. In the process of doing so, we will see how GMM estimation 
works more generally, and we will be led to develop ways to estimate models 
with both heteroskedasticity and serial correlation of unknown form. In Sec- 
tion 9.3, we study in some detail the heteroskedasticity and autocorrelation 
consistent, or HAC, covariance matrix estimators that we briefly mentioned 
in Section 5.5. Then, in Section 9.4, we introduce a set of tests, based on 
GMM criterion functions, that are widely used for inference in conjunction 
with GMM estimation. In Section 9.5, we move beyond regression models 
to give a more formal and advanced presentation of GMM, and we postpone 
to this section most of the proofs of consistency, asymptotic normality, and 
asymptotic efficiency for GMM estimators. In Section 9.6, which depends 
heavily on the more advanced treatment of the preceding section, we consider 
the Method of Simulated Moments, or MSM. This method allows us to obtain 
GMM estimates by simulation even when we cannot analytically evaluate the 
functions that play the same role as residuals for a regression model. 


9.2 GMM Estimators for Linear Regression Models 
Consider the linear regression model 
y=XB+u, E(uu')=2, (9.01) 


where there are n observations, and 2 is an n x n covariance matrix. As in 
the previous chapter, some of the explanatory variables that form the n x k 
matrix X may not be predetermined with respect to the error terms u. How- 
ever, there is assumed to exist an n x l matrix of predetermined instrumental 
variables, W, with n > land l > k, satisfying the condition E(u; | W;) = 0 for 
each row W; of W, t= 1,...,n. Any column of X that is predetermined will 
also be a column of W. In addition, we assume that, for all t,s = 1,...,n, 
E (utus | Wi, Ws) = wis, where wrs is the tst? element of R. We will need this 
assumption later, because it allows us to see that 


Var(n `"? W Tu) = E(W luu W) = 4 X` S E(u. Wi" W,) 
t=1 s=1 
4 >. > E(E (urus W; W, | W;, W.)) 


t=1 s=1 
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SSC E(wisWi'W,) = -E(W'2W). (9.02) 


t=1 s=1 


z|= 


The assumption that E(u; | W;) = 0 implies that, for all t = 1,...,n, 
E(W,' (y: — X:8)) = 0. (9.03) 


These equations form a set of what we may call theoretical moment conditions. 
They were used in Chapter 8 as the starting point for MM estimation of the 
regression model (9.01). Each theoretical moment condition corresponds to a 
sample moment, or empirical moment, of the form 


1 1 
a yi, Wri (yt — X38) = — wil (y — XB), (9.04) 
t=1 
where w;, i= 1,...,1, is the it” column of W. When | = k, we can set these 


sample moments equal to zero and solve the resulting k equations to obtain the 
simple IV estimator (8.12). When l > k, we must do as we did in Chapter 8 
and select k independent linear combinations of the sample moments (9.04) 
in order to obtain an estimator. 


Now let J be an l x k matrix with full column rank k, and consider the 
MM estimator obtained by using the k columns of WJ as instruments. This 
estimator solves the k equations 


J'W'(y— XB) =0, (9.05) 


which are referred to as sample moment conditions, or just moment conditions 
when there is no ambiguity. They are also sometimes called orthogonality 
conditions, since they require that the vector of residuals should be orthogonal 
to the columns of WJ. Let us assume that the data are generated by a DGP 
which belongs to the model (9.01), with coefficient vector Bo and covariance 
matrix Qo. Under this assumption, we have the following explicit expression, 
suitable for asymptotic analysis, for the estimator 3 that solves (9.05): 


ni/2(8 = Bo) = (nT WX) na We. (9.06) 


From this, recalling (9.02), we find that the asymptotic covariance matrix 
of 3, that is, the covariance matrix of the plim of n!/ 2(B — Bo), is 
=Í —1 

( plim 1J'WTX) ( plim L JTWTRWJ) ( plim IxX'WJ) . (9.07) 


This matrix has the familiar sandwich form that we expect to see when an 
estimator is not asymptotically efficient. 
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The next step, as in Section 8.3, is to choose J so as to minimize the covariance 
matrix (9.07). We may reasonably expect that, with such a choice of J, the 
covariance matrix will no longer have the form of a sandwich. The simplest 
choice of J that eliminates the sandwich in (9.07) is 


J =(W'QW)'W'x; (9.08) 


notice that, in the special case in which {Qo is proportional to I, this expression 
will reduce to the result (8.24) that we found in Section 8.3 as the solution 
for that special case. We can see, therefore, that (9.08) is the appropriate 
generalization of (8.24) when 2 is not proportional to an identity matrix. 
With J defined by (9.08), the covariance matrix (9.07) becomes 


=i 
plim (2 XTW(WTQ)W) WTX) (9.09) 


m— Co 


and the efficient GMM estimator is 
Boum = (X'W (WT RW) WTX XTW(W'QoWw) Wy. (9.10) 


When 99 = 071, this estimator reduces to the generalized IV estimator (8.29). 
In Exercise 9.1, readers are invited to show that the difference between the 
covariance matrices (9.07) and (9.09) is a positive semidefinite matrix, thereby 
confirming (9.08) as the optimal choice for J. 


The GMM criterion function 


With both GLS and IV estimation, we showed that the efficient estimators 
could also be derived by minimizing an appropriate criterion function; this 
function was (7.06) for GLS and (8.30) for IV. Similarly, the efficient GMM 
estimator (9.10) minimizes the GMM criterion function 


Q(B, y) = (y— XB)'W(W' QW) W ' (y — XP), (9.11) 


as can be seen at once by noting that the first-order conditions for minimiz- 
ing (9.11) are 
X'W(W' RoW) W! (y — XB) = 0. 


If Qo = o@I, (9.11) reduces to the IV criterion function (8.30), divided by o@. 
In Section 8.6, we saw that the minimized value of the IV criterion func- 
tion, divided by an estimate of ø?, serves as the statistic for the Sargan test 
for overidentification. We will see in Section 9.4 that the GMM criterion 
function (9.11), with the usually unknown matrix (2 replaced by a suitable 
estimate, can also be used as a test statistic for overidentification. 


The criterion function (9.11) is a quadratic form in the vector W '(y— XB) of 
sample moments and the inverse of the matrix W |! RoW. Equivalently, it is a 
quadratic form in n~!/2 W '(y — XB) and the inverse of n~1W'QoW, since 
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the powers of n cancel. Under the sort of regularity conditions we have used 
in earlier chapters, n~!/2 W '(y — Xo) satisfies a central limit theorem, and 
so tends, as n — oo, to a normal random variable, with mean vector O and 
covariance matrix the limit of n~'W'Q)W. It follows that (9.11) evaluated 
using the true Go and the true Ro is asymptotically distributed as x? with 
l degrees of freedom; recall Theorem 4.1, and see Exercise 9.2. 


This property of the GMM criterion function is simply a consequence of its 
structure as a quadratic form in the sample moments used for estimation and 
the inverse of the asymptotic covariance matrix of these moments evaluated 
at the true parameters. As we will see in Section 9.4, this property is what 
makes the GMM criterion function useful for testing. The argument leading 
to (9.10) shows that this same property of the GMM criterion function leads 
to the asymptotic efficiency of the estimator that minimizes it. 


Provided the instruments are predetermined, so that they satisfy the condition 
that E(u:| W;) = 0, we still obtain a consistent estimator, even when the 
matrix J used to select linear combinations of the instruments is different 
from (9.08). Such a consistent, but in general inefficient, estimator can also 
be obtained by minimizing a quadratic criterion function of the form 


(y— XB) WAW '(y — XP), (9.12) 


where the weighting matrix A is / x l, positive definite, and must be at least 
asymptotically nonrandom. Without loss of generality, A can be taken to be 
symmetric; see Exercise 9.3. The inefficient GMM estimator is 


B= (X'WAW'X)|x'WAW'y, (9.13) 


from which it can be seen that the use of the weighting matrix A corresponds 
to the implicit choice J = AW'X. For a given choice of J, there are various 
possible choices of A that give rise to the same estimator; see Exercise 9.4. 


When l = k, the model is exactly identified, and J is a nonsingular square 
matrix which has no effect on the estimator. This is most easily seen by 
looking at the moment conditions (9.05), which are equivalent, when l = k, to 
those obtained by premultiplying them by (J')~1. Similarly, if the estimator 
is defined by minimizing a quadratic form, it does not depend on the choice 
of A whenever l = k. To see this, consider the first-order conditions for 
minimizing (9.12), which, up to a scalar factor, are 


X'WAW '(y — XB) = 0. 


If 1 = k, X'W is a square matrix, and the first-order conditions can be 
premultiplied by A~1(X'W)-!. Therefore, the estimator is the solution to 
the equations W'(y — X) = 0, independently of A. This solution is just 
the simple IV estimator defined in (8.12). 
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When l > k, the model is overidentified, and the estimator (9.13) depends 
on the choice of J or A. The efficient GMM estimator, for a given set of 
instruments, is defined in terms of the true covariance matrix (Qo, which is 
usually unknown. If 9% is known up to a scalar multiplicative factor, so 
that Q = 07Ao, with o? unknown and Ag known, then Ag can be used in 
place of o in either (9.10) or (9.11). This is true because multiplying 2o 
by a scalar leaves (9.10) invariant, and it also leaves invariant the 8 that 
minimizes (9.11). 


GMM Estimation with Heteroskedasticity of Unknown Form 


The assumption that Rọ is known, even up to a scalar factor, is often too 
strong. What makes GMM estimation practical more generally is that, in 
both (9.10) and (9.11), Qo appears only through the l x | matrix product 
W'QoW. As we saw first in Section 5.5, in the context of heteroskedasticity 
consistent covariance matrix estimation, n7t times such a matrix can be esti- 
mated consistently if Qo is a diagonal matrix. What is needed is a preliminary 
consistent estimate of the parameter vector B, which furnishes residuals that 
are consistent estimates of the error terms. 


The preliminary estimates of B must be consistent, but they need not be 
asymptotically efficient, and so we can obtain them by using any convenient 
choice of J or A. One choice that is often convenient is A = (W'W)-}, 
in which case the preliminary estimator is the generalized IV estimator 
(8.29). We then use the preliminary estimates @ to calculate the residuals 
Ùt = Yt — XB. A typical element of the matrix n~'W!'Q .W can then be 
estimated by 


1 A 
t=1 
This estimator is very similar to (5.36), and the estimator (9.14) can be proved 
to be consistent by using arguments just like those employed in Section 5.5. 


The matrix with typical element (9.14) can be written as n-1W TW, where 
N is an n x n diagonal matrix with typical diagonal element û?. Then the 
feasible efficient GMM estimator is 


ÊÔroum = (X'W (WT ÊW) WTX) X'W(W'RQW)' wy, (9.15) 


which is just (9.10) with 2o replaced by @. Since n~!'W'QW consistently 
estimates n~'|W | RoW, it follows that Bremm is asymptotically equivalent 
to (9.10). It should be noted that, in calling (9.15) efficient, we mean that 
it is asymptotically efficient within the class of estimators that use the given 
instrument set W. 


Like other procedures that start from a preliminary estimate, this one can 
be iterated. The GMM residuals y, — X8ramm can be used to calculate a 
new estimate of (2, which can then be used to obtain second-round GMM 
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estimates, which can then be used to calculate yet another estimate of 92, 
and so on. This iterative procedure was investigated by Hansen, Heaton, 
and Yaron (1996), who called it continuously updated GMM. Whether we 
stop after one round or continue until the procedure converges, the estimates 
will have the same asymptotic distribution if the model is correctly specified. 
However, there is evidence that performing more iterations improves finite- 
sample performance. In practice, the covariance matrix will be estimated by 


Var(Bramm) = (X'W(W'QW) Wx)”. (9.16) 


It is not hard to see that n times the estimator (9.16) tends to the asymptotic 
covariance matrix (9.09) as n — oo. 


Fully Efficient GMM Estimation 


In choosing to use a particular matrix of instrumental variables W, we are 
choosing a particular representation of the information sets Q; appropriate 
for each observation in the sample. It is required that W; € Q; for all t, 
and it follows from this that any deterministic function, linear or nonlinear, 
of the elements of W, also belongs to Q;. It is quite clearly impossible to 
use all such deterministic functions as actual instrumental variables, and so 
the econometrician must make a choice. What we have established so far is 
that, once the choice of W is made, (9.08) gives the optimal set of linear 
combinations of the columns of W to use for estimation. What remains to be 
seen is how best to choose W out of all the possible valid instruments, given 
the information sets Q;. 


In Section 8.3, we saw that, for the model (9.01) with R = o7I, the best 
choice, by the criterion of the asymptotic covariance matrix, is the matrix X 
given in (8.18) by the defining condition that E(X; | Q+) = X;, where X; and 
X; are the tt rows of X and X, respectively. However, it is easy to see that 
this result does not hold unmodified when 2 is not proportional to an identity 
matrix. Consider the GMM estimator (9.10), of which (9.15) is the feasible 
version, in the special case of exogenous explanatory variables, for which the 
obvious choice of instruments is W = X. If, for notational ease, we write Q 
for the true covariance matrix o, (9.10) becomes 


Boum = (X'X(XTQX)1X™X) X'X(X'A XY X'y 


XOX (KX) XX OX) X'y 
ntX 2X (X'OX) xy 
X™X)'X'y = Bors. 


>< 


>< 


'X) 
'X) 


= 
= 
= 


However, we know from the results of Section 7.2 that the efficient estimator 
is actually the GLS estimator 


Bors = (XR XY HXT A ty, (9.17) 


which, except in special cases, is different from Bors. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


9.2 GMM Estimators for Linear Regression Models 357 


The GLS estimator (9.17) can be interpreted as an IV estimator, in which 
the instruments are the columns of RIX. Thus it appears that, when ® is 
not a multiple of the identity matrix, the optimal instruments are no longer 
the explanatory variables X, but rather the columns of Q-'X. This suggests 
that, when at least some of the explanatory variables in the matrix X are 
not predetermined, the optimal choice of instruments is given by Q7-LX. This 
choice combines the result of Chapter 7 about the optimality of the GLS es- 
timator with that of Chapter 8 about the best instruments to use in place of 
explanatory variables that are not predetermined. It leads to the theoretical 
moment conditions 


E(X'Q7'(y— XB)) =0. (9.18) 


Unfortunately, this solution to the optimal instruments problem does not 
always work, because the moment conditions in (9.18) may not be correct. To 
see why not, suppose that the error terms are serially correlated, and that Q 
is consequently not a diagonal matrix. The it element of the matrix product 
in (9.18) can be expanded as 


D D Xyw (ys — X38), (9.19) 


t=1 s=1 


where w** is the tst? element of Rt. If we evaluate at the true parameter 
vector Bo, we find that ys — X,G) = us. But, unless the columns of the 
matrix X are exogenous, it is not in general the case that E(u, | X+) = 0 for 
s Æ t, and, if this condition is not satisfied, the expectation of (9.19) is not 
zero in general. This issue was discussed at the end of Section 7.3, and in 
more detail in Section 7.8, in connection with the use of GLS when one of the 
explanatory variables is a lagged dependent variable. 


Choosing Valid Instruments 


As in Section 7.2, we can construct an n x n matrix W, which will usually be 
triangular, that satisfies the equation Q-' = WwW". As in equation (7.03) of 
Section 7.2, we can premultiply regression (9.01) by W" to get 


P'y=P'XB+ D'u, (9.20) 


with the result that the covariance matrix of the transformed error vector, 
W'wu, is just the identity matrix. Suppose that we propose to use a matrix Z 
of instruments in order to estimate the transformed model, so that we are led 
to consider the theoretical moment conditions 


E(Z'W'(y — XB)) =0. (9.21) 


If these conditions are to be correct, then what we need is that, for each t, 
E((W'u);|Z,) = 0, where the subscript t is used to select the t™ row of the 
corresponding vector or matrix. 
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If X is exogenous, the optimal instruments are given by the matrix Q- LX, and 
the moment conditions for efficient estimation are E(X'Q-1(y — XB)) = 0, 
which can also be written as 


E(X'ww'(y— XB)) =0. (9.22) 


Comparison with (9.21) shows that the optimal choice of Z is W'X. Even if 
X is not exogenous, (9.22) is a correct set of moment conditions if 


B((W"u), | (W"X),) = 0. (9.23) 


But this is not true in general when X is not exogenous. Consequently, we 
seek a new definition for X, such that (9.23) becomes true when X is replaced 
by X. 

In most cases, it is possible to choose W so that (W'w); is an innovation in 
the sense of Section 4.5, that is, so that E((W'w); |9) = 0. As an example, 
see the analysis of models with AR(1) errors in Section 7.8, especially the 
discussion surrounding (7.57). What is then required for condition (9.23) is 
that (W'X), should be predetermined in period t. If Q is diagonal, and so 
also W, the old definition of X will work, because (P! X); = Wj4X;, where Wy 
is the tt! diagonal element of W, and this belongs to Q: by construction. If 
NQ contains off-diagonal elements, however, the old definition of X no longer 
works in general. Since what we need is that (W'X); should belong to Q;, we 
instead define X implicitly by the equation 


E((W'X),|Q2) = (WX). (9.24) 


This implicit definition must be implemented on a case-by-case basis. One 
example is given in Exercise 9.5. 


By setting Z = W'X, we find that the moment conditions (9.21) become 


E(X'Ww'(y— XB)) = E(X" "(y - XB)) = 0. (9.25) 


These conditions do indeed use Q-'X as instruments, albeit with a possibly 
redefined X. The estimator based on (9.25) is 


Brom = (X'Q71X) XTAW1y, (9.26) 


where EGMM denotes “efficient GMM.” The asymptotic covariance matrix 
of (9.26) can be computed using (9.09), in which, on the basis of (9.25), we 
see that W is to be replaced by W'X, X by W'X, and Q by I. We cannot 
apply (9.09) directly with instruments NIX, because there is no reason to 
suppose that the result (9.02) holds for the untransformed error terms u and 
the instruments Q-‘X. The result is 
—1 

plim Gai (4X72) taaa) (9.27) 
n n n 


n— oo 
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By exactly the same argument as that used in (8.20), we find that, for any 
matrix Z that satisfies Z; € Q, 


plim + Z'W'X = plim + ZW" X. (9.28) 


n— oo n— Co 


Since (W'X), € Q, this implies that 


plim + X'Q"1X = plim + X'ww'x 


= plim + XTP P X = plim + X7 X. 


Therefore, the asymptotic covariance matrix (9.27) simplifies to 


2 nail 
plim (2 X'Q"'X) (9.29) 
Although the matrix (9.09) is less of a sandwich than (9.07), the matrix (9.29) 
is still less of one than (9.09). This is a clear indication of the fact that the 
instruments Q-LX, which yield the estimator Beam, are indeed optimal. 
Readers are asked to check this formally in Exercise 9.7. 


In most cases, X is not observed, but it can often be estimated consistently. 
The usual state of affairs is that we have an n x | matrix W of instruments, 
such that §(X) C 8(W) and 


(PW), E Qi. (9.30) 


This last condition is the form taken by the predeterminedness condition 
when 2 is not proportional to the identity matrix. The theoretical moment 
conditions used for (overidentified) estimation are then 


E(W'Q-(y — XB)) = E(W'ww'(y— XB)) = 0, (9.31) 


from which it can be seen that what we are in fact doing is estimating the 
transformed model (9.20) using the transformed instruments #' W. The re- 
sult of Exercise 9.8 shows that, if indeed $(X) C $(W), the asymptotic covar- 
iance matrix of the resulting estimator is still (9.29). Exercise 9.9 investigates 
what happens if this condition is not satisfied. 


The main obstacle to the use of the efficient estimator Grom is thus not the 
difficulty of estimating X, but rather the fact that Q is usually not known. 
As with the GLS estimators we studied in Chapter 7, Beamm cannot be 
calculated unless we either know 2 or can estimate it consistently, usually 
by knowing the form of 92 as a function of parameters that can be estimated 
consistently. But whenever there is heteroskedasticity or serial correlation of 
unknown form, this is impossible. The best we can then do, asymptotically, 
is to use the feasible efficient GMM estimator (9.15). Therefore, when we 
later refer to GMM estimators without further qualification, we will normally 
mean feasible efficient ones. 
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9.3 HAC Covariance Matrix Estimation 


Up to this point, we have seen how to obtain feasible efficient GMM estimates 
only when the matrix 2 is known to be diagonal, in which case we can use 
the estimator (9.15). In this section, we also allow for the possibility of serial 
correlation of unknown form, which causes (2 to have nonzero off-diagonal 
elements. When the pattern of the serial correlation is unknown, we can still, 
under fairly weak regularity conditions, estimate the covariance matrix of the 
sample moments by using a heteroskedasticity and autocorrelation consistent, 
or HAC, estimator of the matrix n'W!'Q W. This estimator, multiplied 
by n, can then be used in place of W'QW in the feasible efficient GMM 
estimator (9.15). 


The asymptotic covariance matrix of the vector n~!/? W '(y — XB) of sample 
moments, evaluated at 3 = 6o, is defined as follows: 


X = plim tW "(y — X8o)(y — XBo)'W = plim+W'RQW. (9.32) 
n— oo n n—> oo n 
A HAC estimator of X is a matrix £ constructed so that $ consistently 
estimates X when the error terms u, display any pattern of heteroskedasticity 
and/or autocorrelation that satisfies certain, generally quite weak, conditions. 
In order to derive such an estimator, we begin by rewriting the definition of 
X in an alternative way: 


X= lim 4Y Y E(uus W, W,), (9.33) 


n— oo 
t=1 s=1 


in which we assume that a law of large numbers can be used to justify replacing 
the probability limit in (9.32) by the expectations in (9.33). 


For regression models with heteroskedasticity but no autocorrelation, only 
the terms with t = s contribute to (9.33). Therefore, for such models, we 
can estimate X consistently by simply ignoring the expectation operator and 
replacing the error terms uz by least squares residuals ĉ+, possibly with a mod- 
ification designed to offset the tendency for such residuals to be too small. The 
obvious way to estimate (9.33) when there may be serial correlation is again 
simply to drop the expectations operator and replace utus by ûtûs, where ty 
denotes the t*” residual from some consistent but inefficient estimation proce- 
dure, such as generalized IV. Unfortunately, this approach will not work. To 
see why not, we need to rewrite (9.33) in yet another way. Let us define the 
autocovariance matrices of the W;'u; as follows: 


1 >D E(urur_;Wi' Wi_;) for j > 0, 
i t=j+1 
T(j)= (9.34) 


1 `> E (uju W, 1; We) for j < 0. 
t=—j41 
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Because there are | moment conditions, these are | x | matrices. It is easy to 
check that I'(j) = ''(—j). Then, in terms of the matrices I'(j), expression 
(9.33) becomes 


B= tim Sr) = lim (r+ (PW +O). (935) 
j=—-n+1 j=l 


Therefore, in order to estimate X, we apparently need to estimate all of the 
autocovariance matrices for 7 = 0,...,n—1. 


If ti, denotes a typical residual from some preliminary estimator, the sample 
autocovariance matrix of order j, I (j), is just the appropriate expression in 
(9.34), without the expectation operator, and with the random variables uz 


and uz; replaced by ti, and tw -,;, respectively. For any j > 0, this is 


PG) = 5 Do tet Wi Wi (9.36) 
t=jtl 


Unfortunately, the sample autocovariance matrix Î (j) of order j is not a con- 
sistent estimator of the true autocovariance matrix for arbitrary j. Suppose, 
for instance, that 7 = n—2. Then, from (9.36), we see that r( j) has only two 
terms, and no conceivable law of large numbers can apply to only two terms. 
In fact, I'(n — 2) must tend to zero as n — oo because of the factor of n=! in 
its definition. 


The solution to this problem is to restrict our attention to models for which 
the actual autocovariances mimic the behavior of the sample autocovariances, 
and for which therefore the actual autocovariance of order 7 tends to zero as 
j — œ. A great many stochastic processes generate error terms for which 
the I'(j) do have this property. In such cases, we can drop most of the 
sample autocovariance matrices that appear in the sample analog of (9.35) by 
eliminating ones for which |j| is greater than some chosen threshold, say p. 
This yields the following estimator for X: 


(PG) + PTO), (9.37) 


Suw = I(0) + 


p 
J= 
We refer to (9.37) as the Hansen-White estimator, because it was originally 
proposed by Hansen (1982) and White and Domowitz (1984); see also White 
(1984). 


For the purposes of asymptotic theory, it is necessary to let the parameter p, 
which is called the lag truncation parameter, go to infinity in (9.37) at some 
suitable rate as the sample size goes to infinity. A typical rate would be n1/*. 
This ensures that, for large enough n, all the nonzero I'(j) are estimated 
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consistently. Unfortunately, this type of result does not say how large p should 
be in practice. In most cases, we have a given, finite, sample size, and we need 
to choose a specific value of p. 


The Hansen-White estimator (9.37) suffers from one very serious deficiency: In 
finite samples, it need not be positive definite or even positive semidefinite. If 
one happens to encounter a data set that yields a nondefinite Xw, then, since 
the weighting matrix for GMM must be positive definite, (9.37) is unusable. 
Luckily, there are numerous ways out of this difficulty. The one that is most 
widely used was suggested by Newey and West (1987). The estimator they 
propose is 


Sew =PO+>- ( - 1) (fi) +27), (9.38) 


in which each sample autocovariance matrix r( j) is multiplied by a weight 
1— j/(p+ 1) that decreases linearly as j increases. The weight is p/(p + 1) 
for j = 1, and it then decreases by steps of 1/(p+1) down to a value of 
1/(p +1) for j = p. This estimator will evidently tend to underestimate the 
autocovariance matrices, especially for larger values of 7. Therefore, p should 
almost certainly be larger for (9.38) than for (9.37). As with the Hansen- 
White estimator, p must increase as n does, and the appropriate rate is n!/°. 
A procedure for selecting p automatically was proposed by Newey and West 
(1994), but it is too complicated to discuss here. 


Both the Hansen-White and the Newey-West HAC estimators of X can be 
written in the form 


£= wW (9.39) 


for an appropriate choice of 2. This fact, which we will exploit in the next 
section, follows from the observation that there exist n x n matrices U (j) such 


that the I'(j) can be expressed in the form n-1W'U(j)W, as readers are 
asked to check in Exercise 9.10. 


The Newey-West estimator is by no means the only HAC estimator that is 
guaranteed to be positive definite. Andrews (1991) provides a detailed treat- 
ment of HAC estimation, suggests some alternatives to the Newey-West esti- 
mator, and shows that, in some circumstances, they may perform better than 
it does in finite samples. A different approach to HAC estimation is suggested 
by Andrews and Monahan (1992). Since this material is relatively advanced 
and specialized, we will not pursue it further here. Interested readers may 
wish to consult Hamilton (1994, Chapter 10) as well as the references already 
given. 


Feasible Efficient GMM Estimation 


In practice, efficient GMM estimation in the presence of heteroskedasticity and 
serial correlation of unknown form works as follows. As in the case with only 
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heteroskedasticity that was discussed in Section 9.2, we first obtain consistent 
but inefficient estimates, probably by using generalized IV. These estimates 
yield residuals û;, from which we next calculate a matrix Š that estimates X 
consistently, using (9.37), (9.38), or some other HAC estimator. The feasible 
efficient GMM estimator, which generalizes (9.15), is then 


Bram = (XTW -WT X) XW Wy. (9.40) 


As before, this procedure may be iterated. The first-round GMM residuals 
may be used to obtain a new estimate of X, which may be used to obtain 
second-round GMM estimates, and so on. For a correctly specified model, 
iteration should not affect the asymptotic properties of the estimates. 


We can estimate the covariance matrix of (9.40) by 
Var (Brom) = n(X'Ws-'wlx)t (9.41) 


which is the analog of (9.16). The factor of n here is needed to offset the 
factor of n~! in the definition of 37. We do not need to include such a factor 
in (9.40), because the two factors of n~! cancel out. As usual, the covariance 
matrix estimator (9.41) can be used to construct pseudo-t tests and other 
Wald tests, and asymptotic confidence intervals and confidence regions may 
also be based on it. The GMM criterion function that corresponds to (9.40) is 


(y - XP)'W E-W (y — XB). (9.42) 


1 
n 
Once again, we need a factor of n~! here to offset the one in 5. 


The feasible efficient GMM estimator (9.40) can be used even when all the 
columns of X are valid instruments and OLS would be the estimator of choice 
if the error terms were not heteroskedastic and/or serially correlated. In this 
case, W typically consists of X augmented by a number of functions of the 
columns of X, such as squares and cross-products, and Ê has squared OLS 
residuals on the diagonal. This estimator, which was proposed by Cragg 
(1983) for models with heteroskedastic error terms, will be asymptotically 
more efficient than OLS whenever #2 is not proportional to an identity matrix. 


9.4 Tests Based on the GMM Criterion Function 


For models estimated by instrumental variables, we saw in Section 8.5 that 
any set of r equality restrictions can be tested by taking the difference between 
the minimized values of the IV criterion function for the restricted and unre- 
stricted models, and then dividing it by a consistent estimate of the error var- 
iance. The resulting test statistic is asymptotically distributed as y?(r). For 
models estimated by (feasible) efficient GMM, a very similar testing procedure 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


364 The Generalized Method of Moments 


is available. In this case, as we will see, the difference between the constrained 
and unconstrained minima of the GMM criterion function is asymptotically 
distributed as y?(r). There is no need to divide by an estimate of o°, because 
the GMM criterion function already takes account of the covariance matrix 
of the error terms. 


Tests of Overidentifying Restrictions 


Whenever l > k, a model estimated by GMM involves l — k overidentifying 
restrictions. As in the IV case, tests of these restrictions are even easier 
to perform than tests of other restrictions, because the minimized value of 
the optimal GMM criterion function (9.11), with n~-1'W'Q 9W replaced by 
a HAC estimate, provides an asymptotically valid test statistic. When the 
HAC estimate ¥ is expressed as in (9.39), the GMM criterion function (9.42) 
can be written as 


Q(B, y) = (y - XB)" W(W' QW)! W'(y — XB). (9.43) 


Since HAC estimators are consistent, the asymptotic distribution of (9.43), 
for given 3, is the same whether we use the unknown true (2 or a matrix Q 
that provides a HAC estimate. For simplicity, we therefore use the true Qo, 
omitting the subscript 0 for ease of notation. The asymptotic equivalence of 
the Bramm of (9.15) or (9.40) and the Bamm of (9.10) further implies that 
what we will prove for the criterion function (9.43) evaluated at Boum with 
2 replaced by Q, will equally be true for (9.43) evaluated at Gray. 


We remarked in Section 9.2 that Q(Go,y), where Bo is the true parameter 
vector, is asymptotically distributed as y?(1). In contrast, the minimized 
criterion function Q(Ĝcmm, y) is distributed as y2(1 — k), because we lose 
k degrees of freedom as a consequence of having estimated k parameters. 
In order to demonstrate this result, we first express (9.43) in terms of an 
orthogonal projection matrix. This allows us to reuse many of the calculations 
performed in Chapter 8. 


As in Section 9.2, we make use of a possibly triangular matrix W that satisfies 
the equation Nt = WW", or, equivalently, 


R =(P yw, (9.44) 
If the n x | matrix A is defined as W~1W, and P4 = A(A'AJ 1A], then 


Q(B, y) = (y - XP) Pa- w (wE) ww) we) (y — XB) 
= (y — XB)'W PAY ' (y — XP). (9.45) 


Since Baxi minimizes (9.45), we see that one way to write it is 
Boum = (XTY PAY X) XY PAU y; (9.46) 
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compare (9.10). Expression (9.46) makes it clear that Gam can be thought 
of as a GIV estimator for the regression of W'X on W'y using instruments 
A=wW~'W. As in (8.61), it can be shown that 


Pa®"(y — XÊcmm) = Pa(1— Pp,wtx)¥'y, 


where Pp,wtx is the orthogonal projection on to the subspace 8(PaV'X). 
It follows that 


Q(Baum. yY) = yP (Pa — Ppywtx) D'y, (9.47) 


which is the analog for GMM estimation of expression (8.61) for generalized 
IV estimation. 


Now notice that 
(Pa — Ppywtx)¥'X 
= Pyb'X — PaW'X(X'WPyW'X) 'X'WPaw'x 
= Paw'X — Paw'X =O. 


Since y = Xf, + u if the model we are estimating is correctly specified, this 
implies that (9.47) is equal to 


Q(Bcum, y) = ul P (Pa — Ppywtx) Y 'u. (9.48) 


This expression can be compared with the value of the criterion function 
evaluated at Bo, which can be obtained directly from (9.45): 


Q(Bo,y) = ul WPaW'u. (9.49) 


The two expressions (9.48) and (9.49) show clearly where the k degrees of 
freedom are lost when we estimate 3. We know that E(W'w) = 0 and that 
E(W'uu'W) = W' QW = I, by (9.44). The dimension of the space $(A) is 
equal to l. Therefore, the extension of Theorem 4.1 treated in Exercise 9.2 
allows us to conclude that (9.49) is asymptotically distributed as y?(1). Since 
8(P4W'X) is a k-dimensional subspace of $(A), it follows (see Exercise 2.16) 
that Pa — Pp,w x is an orthogonal projection on to a space of dimension 
l— k, from which we see that (9.48) is asymptotically distributed as x? (l — k). 
Replacing Bo by Boum in (9.48) thus leads to the loss of the k dimensions of 
the space $8(P4W'X), which are “used up” when we obtain Deri 


A 


The statistic Q(Bamm, Y) is the analog, for efficient GMM estimation, of the 
Sargan test statistic that was discussed in Section 8.6. This statistic was 
suggested by Hansen (1982) in the famous paper that first proposed GMM 
estimation under that name. It is often called Hansen’s overidentification sta- 
tistic or Hansen’s J statistic. However, we prefer to call it the Hansen-Sargan 
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statistic to stress its close relationship with the Sargan test of overidentifying 
restrictions in the context of generalized IV estimation. 


As in the case of IV estimation, a Hansen-Sargan test may reject the null 
hypothesis for more than one reason. Perhaps the model is misspecified, either 
because one or more of the instruments should have been included among the 
regressors, or for some other reason. Perhaps one or more of the instruments 
is invalid because it is correlated with the error terms. Or perhaps the finite- 
sample distribution of the test statistic just happens to differ substantially 
from its asymptotic distribution. In the case of feasible GMM estimation, 
especially involving HAC covariance matrices, this last possibility should not 
be discounted. See, among others, Hansen, Heaton, and Yaron (1996) and 
West and Wilcox (1996). 


Tests of Linear Restrictions 


Just as in the case of generalized IV, both linear and nonlinear restrictions 
on regression models can be tested by using the difference between the con- 
strained and unconstrained minima of the GMM criterion function as a test 
statistic. Under weak conditions, this test statistic will be asymptotically dis- 
tributed as y? with as many degrees of freedom as there are restrictions to 
be tested. For simplicity, we restrict our attention to zero restrictions on the 
linear regression model (9.01). This model can be rewritten as 


y = X18,+ XoAo+u, E(uu')=29, (9.50) 


where (3; is a ky-vector and J is a ko—-vector, with k = kı + ko. We wish to 
test the restrictions G2 = 0. 


If we estimate (9.50) by feasible efficient GMM using W as the matrix of in- 
struments, subject to the restriction that G2 = 0, we will obtain the restricted 
estimates Bremm = [G1 i 0]. By the reasoning that leads to (9.48), we see 
that, if indeed G2 = 0, the constrained minimum of the criterion function is 


Q(Brcum. Y) = (y — Xi 81)" W(W' QW) Wy — X11) 
= u'W(P4 — Pp,gtx, Vu. (9.51) 


If we subtract (9.48) from (9.51), we find that the difference between the 
constrained and unconstrained minima of the criterion function is 


Q(Bramm, Y) — Q(Bremu. y) = u'Y(Ppywtx — Ppywrx,)Y'u. (9.52) 


Since §(PaW'X)) C 8(PaW'X), we see that Pp,øtx — Pp,wTx, is an or- 
thogonal projection matrix of which the image is of dimension k — kı = kə. 
Once again, the result of Exercise 9.2 shows that the test statistic (9.52) is 
asymptotically distributed as y?(k2) if the null hypothesis that G2 = 0 is true. 
This result continues to hold if the restrictions are nonlinear, as we will see 
in Section 9.5. 
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The result that the statistic Q( Bream, y)— Q(ÎÔFGMM; y) is asymptotically 
distributed as x? (ka) depends on two critical features of the construction of 
the statistic. The first is that the same matrix of instruments W is used for 
estimating both the restricted and unrestricted models. This was also required 
in Section 8.5, when we discussed testing restrictions on linear regression 
models estimated by generalized IV. The second essential feature is that the 
same weighting matrix (WTR WY! is used when estimating both models. If, 
as is usually the case, this matrix has to be estimated, it is important that the 
same estimate be used in both criterion functions. If different instruments or 
different weighting matrices are used for the two models, (9.52) is no longer 
in general asymptotically distributed as x? (k2). 


One interesting consequence of the form of (9.52) is that we do not always 
need to bother estimating the unrestricted model. The test statistic (9.52) 
must always be less than the constrained minimum Q(8remm; y). Therefore, 
if Q(Bramm, y) is less than the critical value for the y?(k2) distribution at 
our chosen significance level, we can be sure that the actual test statistic will 
be even smaller and will not lead us to reject the null. 


The result that tests of restrictions may be based on the difference between 
the constrained and unconstrained minima of the GMM criterion function 
holds only for efficient GMM estimation. It is not true for nonoptimal crite- 
rion functions like (9.12), which do not use an estimate of the inverse of the 
covariance matrix of the sample moments as a weighting matrix. When the 
GMM estimates minimize a nonoptimal criterion function, the easiest way to 
test restrictions is probably to use a Wald test; see Sections 6.7 and 8.5. How- 
ever, we do not recommend performing inference on the basis of nonoptimal 
GMM estimation. 


9.5 GMM Estimators for Nonlinear Models 


The principles underlying GMM estimation of nonlinear models are the same 
as those we have developed for GMM estimation of linear regression models. 
For every result that we have discussed in the previous three sections, there is 
an analogous result for nonlinear models. In order to develop these results, we 
will take a somewhat more general and abstract approach than we have done 
up to this point. This approach, which is based on the theory of estimating 
functions, was originally developed by Godambe (1960); see also Godambe 
and Thompson (1978). 


The method of estimating functions employs the concept of an elementary 
zero function. Such a function plays the same role as a residual in the esti- 
mation of a regression model. It depends on observed variables, at least one 
of which must be endogenous, and on a k-vector of parameters, 0. As with 
a residual, the expectation of an elementary zero function must vanish if it is 
evaluated at the true value of 0, but not in general otherwise. 
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We let f:(@,y) denote an elementary zero function for observation t. It is 
called “elementary” because it applies to a single observation. In the linear 
regression case that we have been studying up to this point, @ would be 
replaced by 6 and we would have fi(B, yt) = yt — X;. In general, we may 
well have more than one elementary zero function for each observation. 


We consider a model M, which, as usual, is to be thought of as a set of DGPs. 
To each DGP in M, there corresponds a unique value of 0, which is what 
we often call the “true” value of O for that DGP. It is important to note 
that the uniqueness goes just one way here: A given parameter vector 0 may 
correspond to many DGPs, perhaps even to an infinite number of them, but 
each DGP corresponds to just one parameter vector. In order to express the 
key property of elementary zero functions, we must introduce a symbol for 
the DGPs of the model M. It is conventional to use the Greek letter u for this 
purpose, but then it is necessary to avoid confusion with the conventional use 
of u to denote a population mean. It is usually not difficult to distinguish the 
two uses of the symbol. 


The key property of elementary zero functions can now be written as 


Ex, (fe(O.; yt) = 0, (9.53) 


where E,,(-) denotes the expectation under the DGP p, and 0, is the (unique) 
parameter vector associated with u. It is assumed that property (9.53) holds 
for all t and for all u € M. 


If estimation based on elementary zero functions is to be possible, these func- 
tions must satisfy a number of conditions in addition to condition (9.53). Most 
importantly, we need to ensure that the model is asymptotically identified. 
We therefore assume that, for some observations, at least, 


Ex (fe(0, yz)) #0 forall 0 # 6,. (9.54) 


This just says that, if we evaluate f; at a @ that is different from the 6, 
that corresponds to the DGP under which we take expectations, then the 
expectation of f:(0, y+) will be nonzero. Condition (9.54) does not have to 
hold for every observation, but it must hold for a fraction of the observations 
that does not tend to zero as n — oo. 


In the case of the linear regression model, if we write Go for the true parameter 
vector, condition (9.54) will be satisfied for observation t if, for all G 4 Bo, 


E(yz — X:8) = E(X: (bo — B) + ue) = E(Xi(Bo — B)) £0. (9.55) 


It is clear from (9.55) that condition (9.54) will be satisfied whenever the fitted 
values actually depend on all the components of the vector @ for at least some 
fraction of the observations. This is equivalent to the more familiar condition 
that 

Sxtx = plim iXX 


n—> oo 
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is a positive definite matrix; see Section 6.2. 


We also need to make some assumption about the variances and covariances of 
the elementary zero functions. If there is just one elementary zero function per 
observation, we let f(0, y) denote the n—vector with typical element f;(0, y+). 
If there are m > 1 elementary zero functions per observation, then we can 
group all of them into a vector f(0, y) with nm elements. In either event, we 
then assume that 


E(f(6,y) f'(0,y)) = R, (9.56) 


where 92, which implicitly depends on pu, is a finite, positive definite matrix. 
Thus we are assuming that, under every DGP u € M, each of the f has a 
finite variance and a finite covariance with every f, for s Æ t. 


Estimating Functions and Estimating Equations 


Like every procedure that is based on the method of moments, the method of 
estimating functions replaces relationships like (9.53) that hold in expectation 
with their empirical, or sample, counterparts. Because @ is a k-vector, we 
will need k estimating functions in order to estimate it. In general, these are 
weighted averages of the elementary zero functions. Equating the estimating 
functions to zero yields k estimating equations, which must be solved in order 
to obtain the GMM estimator. 


As for the linear regression model, the estimating equations are, in fact, just 
sample moment conditions which, in most cases, are based on instrumental 
variables. There will generally be more instruments than parameters, and 
so we will need to form linear combinations of the instruments in order to 
construct precisely k estimating equations. Let W be an n x l matrix of 
instruments, which are assumed to be predetermined. Usually, one column of 
W will be a vector of 1s. Now define Z = WJ, where J is an l x k matrix 
with full column rank k. Later, we will discuss how J, and hence Z, should 
optimally be chosen, but, for the moment, we take Z as given. 


If 0, is the parameter vector for the DGP p under which we take expectations, 
the theoretical moment conditions are 


E( Zi f:(O,., Yt) = 0, (9.57) 


where Z; is the t™ row of Z. Later on, when we take explicit account of the 
covariance matrix 2 in formulating the estimating equations, we will need to 
modify these conditions so that they take the form of conditions (9.31), but 
(9.57) is all that is required at this stage. In fact, even (9.57) is stronger than 
we really need. It is sufficient to assume that Z; and f,(@) are asymptotically 
uncorrelated, which, together with some regularity conditions, implies that 


n 


plim + XO Zf fr(O,,42) = 0. (9.58) 
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The vector of estimating functions that corresponds to (9.57) or (9.58) is the 
k-vector n~!Z'f(0,y). Equating this vector to zero yields the system of 
estimating equations 

1 Z'f(0,y) =0, (9.59) 


and solving this system yields 6, the nonlinear GMM estimator. 


Consistency 


If we are to prove that the nonlinear GMM estimator is consistent, we must 
assume that a law of large numbers applies to the vector n-'Z'f(6,y). This 
allows us to define the k-vector of limiting estimating functions, 


a(9; u) = plim, —Z"f(6,y). (9.60) 


n— Co 


In words, a(9; p) is the probability limit, under the DGP p, of the vector of 
estimating functions. Setting a(0; u) to O yields a set of limiting estimating 
equations. 


Either (9.57) or the weaker condition (9.58) implies that a(0,; u) = 0 for all 
u E€ M. We then need an asymptotic identification condition strong enough 
to ensure that a(0; u) # 0 for all 6 ¥ 6,,. In other words, we require that the 
vector 0, must be the unique solution to the system of limiting estimating 
equations. If we assume that such a condition holds, it is straightforward to 
prove consistency in the nonrigorous way we used in Sections 6.2 and 8.3. 
Evaluating equations (9.59) at their solution 6, we find that 


+ Z'f(6,y) =0. (9.61) 


As n — oo, the left-hand side of this system of equations tends under u 
to the vector a(plim,, 6; u), and the right-hand side remains a zero vector. 
Given the asymptotic identification condition, the equality in (9.61) can hold 
asymptotically only if 

plim, ô = 0n. 

n—> oo 
Therefore, we conclude that the nonlinear GMM estimator 6, which solves the 
system of estimating equations (9.59), consistently estimates the parameter 
vector 9,,, for all u € M, provided the asymptotic identification condition is 
satisfied. 


Asymptotic Normality 


For ease of notation, we now fix the DGP u € M and write 0, = 0o. Thus 
Oo has its usual interpretation as the “true” parameter vector. In addition, 
we suppress the explicit mention of the data vector y. As usual, the proof 
that n!/ 2(ĝ — ĝo) is asymptotically normally distributed is based on a Taylor 
series approximation, a law of large numbers, and a central limit theorem. For 
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the purposes of the first of these, we need to assume that the zero functions 
f: are continuously differentiable in the neighborhood of Oo. If we perform 
a first-order Taylor expansion of n‘/? times (9.59) around ĝo and introduce 
some appropriate factors of powers of n, we obtain the result that 


nV? ZTF (00) +n Z' F(@)n4/?(6 — @) = 0, (9.62) 


where the n x k matrix F(@) has typical element 
(6G) = — (9.63) 


where 0; is the it element of 0. This matrix, like f(0) itself, depends implic- 
itly on the vector y and is therefore stochastic. The notation F(@) in (9.62) 
is the convenient shorthand we introduced in Section 6.2: Row t of the matrix 
is the corresponding row of F(@) evaluated at 0 = ĝ,, where the 0; all satisfy 
the inequality 


| — | < |ô; — I) 


The consistency of 6 then implies that the @; also tend to Oo as n — ov. 


The consistency of the O; implies that 


plim + ZTF (8) = plim + Z'F (00). (9.64) 

n—> co on oo 
Under reasonable regularity conditions, we can apply a law of large numbers 
to the right-hand side of (9.64), and the probability limit is then determinis- 
tic. For asymptotic normality, we also require that it should be nonsingular. 
This is a condition of strong asymptotic identification, of the sort used in 
Section 6.2. By a first-order Taylor expansion of a(0; u) around 09, where it 
is equal to 0, we see from the definition (9.60) that 


a(9; u) plim = ZTF (00) (0 — 60). (9.65) 


n— Co 


Therefore, the condition that the right-hand side of (9.64) is nonsingular is a 
strengthening of the condition that 0 is asymptotically identified. Because it 
is nonsingular, the system of equations 


plim + ZTF (0o)(0 — 60) = 0 


has no solution other than 0 = 09. By (9.65), this implies that a(0; p) 4 0 
for all @ Æ 0o, which is the asymptotic identification condition. 
Applying the results just discussed to equation (9.62), we find that 
A 4 —1 
ni/?(6 — 0) = — ( plim 1 ZF (60)) n7"/2 ZT (00). (9.66) 


n— oo 
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Next, we apply a central limit theorem to the second factor on the right-hand 
side of (9.66). Doing so demonstrates that n'/?(@ — @9) is asymptotically 
normally distributed. By (9.57), the vector n~!/?Z'f (9) must have mean 0, 
and, by (9.56), its covariance matrix is plimn~!Z'QZ. In stating this re- 
sult, we assume that (9.02) holds with the f(@0) in place of the error terms. 
Then (9.66) implies that the vector n!/?(@ — @9) is asymptotically normally 
distributed with mean vector 0 and covariance matrix 


zi =j 
( plim A Z"F(60)) ( plim 1 Z'2Z) ( plim 1 F™(00)Z) (9.67) 
Since this is a sandwich covariance matrix, it is evident that the nonlinear 
GMM estimator @ is not, in general, an asymptotically efficient estimator. 


Asymptotically Efficient Estimation 


In order to obtain an asymptotically efficient nonlinear GMM estimator, we 
need to choose the estimating functions n~!Z'f(@) optimally. This is equiv- 
alent to choosing Z optimally. How we should do this will depend on what 
assumptions we make about F'(0) and Q, the covariance matrix of f (0). Not 
surprisingly, we will obtain results very similar to the results for linear GMM 
estimation obtained in Section 9.2. 


We begin with the simplest possible case, in which R = o7I, and F(6o) is 
predetermined in the sense that 


E(Fi(00) f:(90)) = 9, (9.68) 


where F,(0o) is the t'® row of F(@o). If we ignore the probability limits 
and the factors of n~1, the sandwich covariance matrix (9.67) is in this case 


proportional to 
(Z'o) 'Z'Z( Fy Zy}, (9.69) 


where, for ease of notation, Fo = F(0o). The inverse of (9.69), which is 
proportional to the asymptotic precision matrix of the estimator, is 


FJ Z(Z'Z)'Z'Fo = Fy PzF). (9.70) 


If we set Z = Fo, (9.69) is no longer a sandwich, and (9.70) simplifies to 
Fo Fo. The difference between Fo! Fo and the general expression (9.70) is 


Fo Fo — Fp Pz Fo = Fo Mz Fo, 


which is a positive semidefinite matrix because Mz = I— Pz is an orthogonal 
projection matrix. Thus, in this simple case, the optimal instrument matrix 
is just Fo. 


Since we do not know 6p, it is not feasible to use Fo directly as the matrix of 
instruments. Instead, we use the trick that leads to the moment conditions 
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(6.27) which define the NLS estimator. This leads us to solve the estimating 
equations 
+ F'(0) (8) = 0. (9.71) 


If R = o7I, and F(@) is predetermined, solving these equations yields an 
asymptotically efficient GMM estimator. 


It is not valid to use the columns of F'(@) as instruments if condition (9.68) 
is not satisfied. In that event, the analysis of Section 8.3, taken up again in 
Section 9.2, suggests that we should replace the rows of Fo by their expecta- 
tions conditional on the information sets Q; generated by variables that are 
exogenous or predetermined for observation t. Let us define an n x k matrix 
F, in terms of its typical row F,, and another n x k matrix V, as follows: 


F, = E(F,(99)|Qz1) and V =h- F. (9.72) 


The matrices F' and V are entirely analogous to the matrices X and V used 
in Section 8.3. The definitions (9.72) imply that 


plim 1 FTF = plim FF + V) = plim 1FTF. (9.73) 
The term plimn~'F''V equals O because (9.72) implies that E(V; | 9+) = 0, 
and the conditional expectation F, belongs to the information set Q4. 
To find the asymptotic covariance matrix of n!/2(@ — 0o) when F is used in 


place of Z and the covariance matrix of f(0) is 071, we start from expression 
(9.67). Using (9.73), we obtain 


es —1 ee _\-1 
o? ( plim iF Fo) ( plim i F'F) ( plim 1 Ry F) 


n— CO n— Co n—> oo 


Z _\—l1 
= o°( plim + F F) (9.74) 


For any other choice of instrument matrix Z, the argument giving (9.73) shows 
that plimn7!Z'Fo = plim n™tZ TF, and so the covariance matrix (9.67) be- 
comes A a 

o? ( plim iZ'F) ( plim 12°) (plim FZ) . (9.75) 


n— oo noo n— CoO 


The inverse of (9.75) is 1/o? times the probability limit of 


4 F'Z(Z'Z) ZF =-F' PZF. (9.76) 
This expression is analogous to expression (8.21) for the asymptotic precision 
of the IV estimator for linear regression models with endogenous explana- 
tory variables. Since the difference between n-lF'F and (9.76) is the pos- 
itive semidefinite matrix n-1F'MzF, we conclude that (9.74) is indeed the 
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asymptotic covariance matrix that corresponds to the optimal choice of Z. 
Therefore, when F;(@) is not predetermined, we should use its expectation 
conditional on Q; in the matrix of instruments. 


In practice, of course, the matrix F will rarely be observed. We therefore 
need to estimate it. The natural way to do so is to regress F(@) on an n x | 
matrix of instruments W, where l > k, with the inequality holding strictly in 
most cases. This yields fitted values Pw F(@). If we estimate F in this way, 
the optimal estimating equations become 


1 F'(0)Pwf(0) =0. (9.77) 
By reasoning like that which led to (8.27) and (9.73), it can be seen that these 
estimating equations are asymptotically equivalent to the same equations with 
F in place of F(@). In particular, if S(F) C 8(W), the estimator obtained 
by solving (9.77) is asymptotically equivalent to the one obtained using the 
optimal instruments F. 


The estimating equations (9.77) generalize the first-order conditions (8.28) for 
linear IV estimation and the moment conditions (8.84) for nonlinear IV esti- 
mation. As readers are asked to show in Exercise 9.14, the solution to (9.77) 
in the case of the linear regression model is simply the generalized IV estima- 
tor (8.29). As can be seen from (9.67), the asymptotic covariance matrix of 
the estimator 0 defined by (9.77) can be estimated by 


ô? (P Pw PY}, 


where F = F(6), and 6? = n! 5f, 170) the average of the squares of the 
elementary zero functions evaluated at 0, is a natural estimator of o°. 


Efficient Estimation with an Unknown Covariance Matrix 


When the covariance matrix 2 is unknown, the GMM estimators defined by 
the estimating equations (9.71) or (9.77), according to whether or not F'(@) is 
predetermined, are no longer asymptotically efficient in general. But, just as 
we did in Section 9.3 with regression models, we can obtain estimates that are 
efficient for a given set of instruments by using a heteroskedasticity-consistent 
or a HAC estimator. 


Suppose there are l > k instruments which form an n x l matrix W. As in 
Section 9.2, we can construct estimating equations with instruments Z = WJ, 
using a full-rank l x k matrix J to select k linear combinations of the full set 
of instruments. The asymptotic covariance matrix of the estimator obtained 
by solving these equations is then, by (9.67), 


zj 
plim £ FJ WJ) . (9.78) 


n— Co 


( plim + JWP) i ( plim 1 J'WTAWJ) ( 


n— Co 
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This looks just like (9.07) with Fo in place of the regressor matrix X. The 
optimal choice of J is therefore just (9.08) with Fo in place of X. Since (9.08) 
depends on the unknown true Q, we replace n~!'W'QW by an estimator 5 
which could be either a heteroskedasticity-consistent or a HAC estimator. 
This yields the estimating equations 


F'(0)Ws'w'f(@) = 0, (9.79) 


and the asymptotic covariance matrix (9.78) simplifies to 


P = 
( plim nF WE" WTF) (9.80) 


n— OO 


in which, if F(0) is not predetermined, we may write F instead of Fy without 
changing the limit. In practice, we can use 


m 


Var(ô) = n(PTW SIWT EY}, (9.81) 


where F = F (Ô), to estimate the covariance matrix of Ô. As with the estima- 
tor (9.41) for the linear regression case, the factor of n is needed to offset the 
factor of n=! in X. The matrix (9.81) can be used to construct Wald tests 
and asymptotic confidence intervals in the usual way. 


Efficient Estimation with a Known Covariance Matrix 


When the covariance matrix (2 is known, we can obtain a fully efficient GMM 
estimator. As before, we will let W denote an n x n matrix which satisfies the 
equation Q-! = WW". The variance of the vector W'f(@9), where 8o is the 
true parameter vector for the DGP that generates the data, is then 


E(D'f (00) f (00) P) = PRY =1. 


Thus the components of the vector W'f(@) form a set of zero functions that 
are homoskedastic and serially uncorrelated. As we mentioned in Section 9.2, 
it is often possible to choose W in such a way that these components can be 
thought of as innovations in the sense of Section 4.5, and in this case W will 
usually be upper triangular. 


The matrix W does not depend on the parameters 0. Therefore, the matrix 
of derivatives of the transformed zero functions in the vector W'f (0) is just 
W'F(@). Consequently, if the t” row of W'F(@) is predetermined with re- 
spect to the t'® component of W'f(@), the optimal estimating equations are 
constructed using the columns of W'F(6,) as instruments. Because 09 is not 
known, the optimal instruments are estimated along with the parameters by 
using the estimating equations 


1 pT) ww'f (6) =1F"(0)2-'f(8) =0, (9.82) 
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as in (9.71). The asymptotic covariance matrix of the resulting estimator is 


A = 
Var( plim n!/2(ô — 0) = plim (imar) l (9.83) 
where, as usual, Fy = F(0o). The derivation of (9.83) from (9.67) is quite 
straightforward; see Exercise 9.15. In practice, the covariance matrix of 0 will 
normally be estimated by 


Var(6) = (PTEN. (9.84) 


If the tt! row of W'F(@) is not predetermined with respect to the tt? compo- 
nent of W'f(@), and if this component is an innovation, then we can determine 
the optimal instruments just as we did in Section 9.2. By analogy with (9.24), 
we define the matrix F(@) implicitly by the equation 


E((W'F(8)):|Q2) = (WF (0)). (9.85) 


As in Section 9.2, making this definition explicit depends on the details of 
the particular model under study. The moment conditions for fully efficient 
estimation are then given by (9.82) with F (0) replaced by F'(@). The asymp- 
totic covariance matrix is (9.83) with Fo replaced by Fy, and the covariance 
matrix of Ô can be estimated by (9.84) with F replaced by F(0). All of these 
claims are proved in the same way as were the corresponding ones for linear 
regressions in Section 9.2. 


When the matrix F'(0) is not observable, as will frequently be the case, we can 
often find an n x l matrix of instruments W, where usually l > k, such that 
W satisfies the predeterminedness condition in its form (9.30), and such that 
8(F(@9)) C S(W). In such cases, overidentified estimation that makes use 
of the transformed zero functions W'f(@) and the transformed instruments 
W'W yields asymptotically efficient estimates. The results of Exercises 9.8 
and 9.9 can also be readily extended to the present nonlinear case. 


Minimizing Criterion Functions 

The nonlinear GMM estimators we have discussed in this section can all, like 
the ones for linear regression models, be obtained by minimizing appropri- 
ately chosen quadratic forms. We restrict our attention to cases in which 
plim n-1F'(6)f(@) 4 0, and we employ an n x | matrix of instruments, W. 
When the covariance matrix 92 of the elementary zero functions is unknown, 
but a heteroskedasticity-consistent or HAC estimator Š is available, the ap- 
propriate GMM criterion function is 


1f (0)WwW E- WTO). (9.86) 


Minimizing this function with respect to @ is equivalent to solving the esti- 
mating equations (9.79). 
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In the case in which the matrix 2 is known, or can be estimated consistently, 
the fully efficient estimators of the previous subsection can be obtained by 
minimizing the quadratic form 


f'(O)U Pow 'f (9), (9.87) 


where WW! = R! the components of W'f (Oy) are innovations, and the 
matrix W satisfies the predeterminedness condition in the form (9.30). For 
full efficiency, the span §(W) of the instruments must (asymptotically) include 
as a subspace the span of the F(@9), as defined in (9.85). In Exercise 9.16, 
readers are asked to check that minimizing (9.87) is asymptotically equivalent 


to solving the optimal estimating equations. 


Fortunately, we need not treat (9.86) and (9.87) separately. As in Section 9.4, 
expression (9.86) is asymptotically unchanged if we replace £ by nW RW, 
where (2 is the true covariance matrix of the zero functions. Making this 
replacement, we see that both (9.86) and (9.87) can be written as 


Q(0, y) = f'(0) P PAW f (0), (9.88) 


where A = #-!W and A = W'W for the criterion functions (9.86) and 
(9.87), respectively. Note how closely (9.88) resembles expression (9.45) for 
the linear regression case. 


It is often more convenient to compute GMM estimators by minimizing a 
criterion function than by directly solving a set of estimating equations. One 
advantage is that algorithms for minimizing functions tend to be more stable 
numerically than algorithms for solving sets of nonlinear equations. Another 
advantage is that the criterion function may have more than one stationary 
point. In this event, the estimating equations are satisfied at each of these 
stationary points, although the criterion function may have a unique global 
minimum, which then corresponds to the solution of interest. 


However, the main advantage of working with criterion functions is that the 
minimized value of a GMM criterion function can be used for testing, as we 
have already discussed for the linear regression case in Section 9.4. Notice that 
the factor of n~! in (9.86), which does not matter for estimation, is essential 
when the criterion function is being used for testing. Its role is to offset the 
factor of n~! in the definition of $. 


Tests Based on the GMM Criterion Function 


The Hansen-Sargan overidentification test statistic is Q(6, y), the minimized 
value of the GMM criterion function. Up to an irrelevant scalar factor, the 
first-order conditions for the minimization of (9.88) are 


F'(0)W Paw 'f (0) = 0, (9.89) 
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and it follows from this, either by a Taylor expansion or directly by using the 
result (9.66), that 


be —1 
n!/2( — 0o) £ — (AR PPAW' Fo) no? FP Pa fo, 


where, as usual, Fo and fo denote F(0o) and f(0o), respectively. We now 
follow quite closely the calculations of Section 9.4 in order to show that the 
minimized quadratic form Q(Ô, y) is asymptotically distributed as x? (l — k). 
By a short Taylor expansion, we see that 


Pat f (0) = PAY fo +n"? Pa Fy n'/?(6 — 00) 
= PyW'fy —n-V ? Pa Fo (+ Fo PPAP Fy) n 2 FW PAV! fo 
= (I — Ppor, ) PAY fo, 


where Pp,wtp, projects orthogonally on to 8(P4W'Fo). Thus Q(6, y), the 
minimized value of the criterion function (9.88), is 


f'(0)U Pa 'f(6) = fo VPA- Pp,wtr,)Pa¥ fo 
= fo Y (Pa — Pp,wr,)Y'fo- (9.90) 


Because $(P4W'Fy) C 8(A), the difference of projection matrices in the 
last expression above is itself an orthogonal projection matrix, of which the 
image is of dimension l — k. As with (9.48), we see that estimating 0 uses 
up k degrees of freedom. By essentially the same argument as was used for 
(9.48), it can be shown that (9.90) is asymptotically distributed as y?(1 — k). 
Thus, as expected, Q(6, y) is the Hansen-Sargan test statistic for nonlinear 
GMM estimation. 


As in the case of linear regression models, the difference between the GMM 
criterion function (9.88) evaluated at restricted estimates and evaluated at 
unrestricted estimates is asymptotically distributed as x? (r) when there are r 
equality restrictions. We will not prove this result, which was proved for the 
linear case in Section 9.3. However, we will present a very simple argument 
which provides an intuitive explanation. 


Let 6 and Ô denote, respectively, the vectors of restricted and unrestricted 
(feasible) efficient GMM estimates. From the result for the Hansen-Sargan test 
that was just proved, we know that Q(0,y) and Q(0, y) are asymptotically 
distributed as x?(l — k +r) and y?(l — k), respectively. Therefore, since a 
random variable that follows the y?(m) distribution is equal to the sum of m 
independent y?(1) variables, 


l—k+r l-k 
QE Y a? and Q(6,y) => _ y?, (9.91) 
i=l i=l 
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where the x; and y; are independent, standard normal random variables. Now 
suppose that the first l — k of the x; are equal to the corresponding y;. If so, 
(9.91) implies that 


R . l—k+r l—k l—k+r 
Q(6,y)-Q(6,y)= So -Y SO a. (9.92) 
i=1 i=l i=l—k+1 


Since the leftmost expression here is the test statistic we are interested in and 
the rightmost expression is evidently distributed as y?(r), we have apparently 
proved the result. The proof is not complete, of course, because we have not 
shown that the first l — k of the x; are, in fact, equal to the corresponding yj. 
To prove this, we would need to show that, asymptotically, Q(0, y) is equal 
to Q(Ô, y) plus another random variable independent of Q(Ô, y). This other 
random variable would then be equal to the rightmost expression in (9.92). 


Nonlinear GMM Estimators: Overview 


We have discussed a large number of nonlinear GMM estimators, and it can 
be confusing to keep track of them all. We therefore conclude this section 
with a brief summary of the principal cases that are likely to be encountered 
in applied econometric work. 


Case 1. Scalar covariance matrix: Q = 071. 


When plimn-!F'(0)f(@) = 0, we solve the estimating equations (9.71) to 
obtain an efficient estimator. This is equivalent to minimizing f'(@)f(8). 
The estimated covariance matrix of 0 will be 


Var (6) = 6?(F'F)!, 


2 2 


where 6° consistently estimates af. If the model is a nonlinear regression 
model, then @ is really the nonlinear least squares estimator discussed in 
Section 6.3. 


When plimn-!F'(@)f(@) # 0, we must replace F(0@) by an estimate of 
its conditional expectation. This means that we solve the estimating equa- 
tions (9.77), which is equivalent to minimizing f'(0)Pwf(@). The estimated 
covariance matrix of Ê will be 


Var (0) = 6?(F Pw ÊY '. 
If the model is a nonlinear regression model, then 6 is really the nonlinear 
instrumental variables estimator discussed in Section 8.9. 


Case 2. Covariance matrix known up to a scalar factor: Q = 07A. 


When plimn-1F'(0) f (0) = 0, we solve the estimating equations (9.82), with 
NQ replaced by A, to obtain an efficient estimator. This is equivalent to 
minimizing f'(@)A-1f(@). The estimated covariance matrix will be 


Var (0) = ô” (FTA Ê) t, 
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2 2 


where ôf consistently estimates o^. If the underlying model is a nonlinear 
regression model, then @ is really the nonlinear GLS estimator discussed in 
Section 7.3. 


When plimn~1F'(0)f(0) 4 0, we again must replace F(@) by an estimate of 
its conditional expectation. This means that we should solve the estimating 
equations (9.89) with A = W'W, where W satisfies A`! = WW". This is 
equivalent to minimizing (9.88) with the same definition of A. The estimated 
covariance matrix will be 


(FF WPytwW'F)". 


If the model is a linear regression model, then Ô is the fully efficient GMM 
estimator (9.26) whenever the span of the instruments W includes the span 
of the optimal instruments X. 


When the matrix A is unknown but depends on a fixed number of parameters 
that can be estimated consistently, we can replace A by a consistent estimator 
A and proceed as if it were known, as in feasible GLS estimation. 


Case 3. Unknown diagonal or general covariance matrix. 


This is the most commonly encountered case in which GMM estimation is 
explicitly used. Fully efficient estimation is no longer possible, but we can 
still obtain estimates that are efficient for a given set of instruments by using 
a consistent estimator È of the matrix X defined in (9.33). This will be 
heteroskedasticity-consistent if QQ is assumed to be diagonal and some sort of 
HAC estimator otherwise. Whether or not plimn~'F'(6)f (0) = 0, we solve 
the estimating equations (9.79), which is equivalent to minimizing (9.86). 
The estimated covariance matrix will be (9.81). If there is to be any gain in 
efficiency relative to NLS or nonlinear IV, it is essential that l, the number of 
columns of W, be greater than k, the number of parameters to be estimated. 


The consistent estimator X is usually obtained from initial estimates that 
are consistent but inefficient. These may be NLS estimates, nonlinear IV 
estimates, or GMM estimates that do not use the optimal weighting matrix. 
The efficient GMM estimates are usually obtained by minimizing the criterion 
function (9.86), and the minimized value of this criterion function then serves 
as a Hansen-Sargan test statistic. 


The first-round estimates Ô can be used to obtain a new estimate of X, which 
can then be used to obtain a second-round estimate of 0, which can be used 
to obtain yet another estimate of X, and so on, until the process converges 
or the investigator loses patience. For a correctly specified model, all of these 
estimators will have the same asymptotic distribution. However, performing 
more than one iteration will often improve the finite-sample properties of the 
estimator. Thus, if computing cost is not a problem, it may well be best to 
use the continuously updated estimator that has been iterated to convergence. 


For a more thorough treatment of the asymptotic theory of GMM estimation, 
see Newey and McFadden (1994). 
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9.6 The Method of Simulated Moments 


It is often possible to use GMM even when the elementary zero functions 
cannot be evaluated analytically. Suppose they take the form 


File, 9) = hely) — m0), t=1,...,n, (9.93) 


where the function h:(y;) depends only on y; and, possibly, on exogenous or 
predetermined variables. The function m;(@) depends only on exogenous or 
predetermined variables and on the parameters. Like a regression function, 
it is the expectation of h(yz), conditional on the information set Q, under a 
DGP characterized by the parameter vector 0. Estimating such a model by 
GMM presents no special difficulty if the form of m;(0@) is known analytically, 
but this need not be the case. 


There are numerous situations in which m;(@) may not be known analytically. 
In particular, it may well occur in models which involve latent variables, that 
is, variables which are not observable by an econometrician. The variables 
that actually are observed are related to the latent variables in such a way 
that knowing the former does not permit the values of the latter to be fully 
recovered. One example, which was discussed in Section 8.2, is economic 
variables that are observed with measurement error. Another example is 
variables that are censored, in the sense that they are observed only to a 
limited extent, for instance when only the sign of the variable is observed, or 
when all negative values are replaced by zeros. Even if the distributions of 
the latent variables are tractable, those of the observed variables may not be. 
In particular, it may not be possible to obtain analytic expressions for their 
expectations, or for the expectations of functions of them. 


Even when analytic expressions are not available, it is often possible to obtain 
simulation-based estimates of the distributions of the observed variables. For 
example, suppose that an observed variable is equal to a latent variable plus 
a measurement error of some known distribution, possibly dependent on the 
parameter vector @. Suppose further that, for a DGP characterized by 6, 
we can readily generate simulated values of the latent variable. Simulated 
values of the observed variable can then be generated by adding simulated 
measurement errors, drawn from their known distribution, to the simulated 
values of the latent variable. The mean of these drawings then provides an 
estimate of the expectation of the observed variable. 


In general, an unbiased simulator for the unknown expectation m;(@) is any 
function mž(už,0) of the model parameters, variables in Q;, and a random 
variable už, which either has a known distribution or can be simulated, such 
that, for all 6 in the parameter space, E(mj(u7,0)) = m,(@). To simplify 
notation, we write už as a scalar random variable, but it may well be a vector 
of random variables in practical situations of interest. 


The conceptually simplest unbiased simulator can be implemented as follows. 
For given 0, we obtain S simulated values yj, of the observed variable under 
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the DGP characterized by 0, making use of S random numbers uj,. Then 
we let m*(uz,,9) = he(yf,). If (9.93) is indeed a zero function, then h:(yf,) 
must have expectation m;(@), and it is obvious that the sample mean of the 
simulated values h(y,) is a simulation-based estimate of that expectation. 
This simple simulator, which is applicable whether or not the model involves 
any latent variables, is not the only possible simulator, and it may not be the 
most desirable one for some purposes. However, we will not consider more 
complicated simulators in this book. 


If an unbiased simulator is available, the elementary zero functions (9.93) can 
be replaced by the functions 


ale 


S 
Fë (ye, 0) = hely) 2 (už, 0), (9.94) 


where the uj,, t = 1,...,n, s = 1,..., S, are mutually independent draws. 
Since these draws are computer generated, they are evidently independent of 
the ys. The functions (9.94) are legitimate elementary zero functions, even 
in the trivial case in which S = 1. If the true DGP is characterized by 00, 
then E(he(y)) = me(00) by definition, and E(m}(uj,,@0)) = m:(80) for all s 
by construction. It follows that the expectation (9.94) is zero for 0 = 8o, but 
not in general for other values of 0. 


The application of GMM to the zero functions (9.94) is called the method of 
simulated moments, or MSM. We can use an n x l matrix W of appropriate 
instruments, with | > k, in order to form the empirical moments 


w'f*(6), (9.95) 


in which the n-vector of functions f*(@) has typical element fř(y+,0). A 
GMM estimator that is efficient relative to this set of empirical moments may 
be obtained by minimizing the quadratic form 


Q(O, y) =— f° (0)WE'W'F*(8) (9.96) 


with respect to 0, where 5 consistently estimates the covariance matrix of 
n12 WTf*(0). 

Minimizing the criterion function (9.96) with respect to @ proceeds in the 
usual way, with one important proviso. Each evaluation of f*(0) requires a 
large number of pseudo-random numbers (generally, at least nS of them). It 
is absolutely essential that the same set of random numbers be used every 
time f*(@) is evaluated for a new value of the parameter vector 8. Otherwise, 
(9.96) would change not only as a result of changes in @ but also as a result 
of changes in the random numbers used for the simulation. Therefore, if 
the algorithm happened to evaluate the criterion function twice at the same 
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parameter vector, it would obtain two different values of Q(0, y), and it could 
not possibly tell where the minimum was located. 


The details of the simulations will, of course, differ from case to case. An 
important point is that, since we require a fully specified DGP in order to 
generate the simulated data, it will generally be necessary to make stronger 
distributional assumptions for the purposes of MSM estimation than for the 
purposes of GMM estimation. 


The Asymptotic Distribution of the MSM Estimator 


Because the criterion function (9.96) is based on genuine zero functions, the 
estimator O9ygm obtained by minimizing it will be consistent whenever the 
parameters are identified. However, as we will see in a moment, using simu- 
lated quantities does affect the asymptotic covariance matrix of the estimator, 
although the effect is generally very small if S is a reasonably large number. 


The first-order conditions for minimizing (9.96), ignoring a factor of 2/n, are 
F*'(0)WS-'w 'f*(@) =0, (9.97) 


where F*(0) is the n x k matrix of which the ti‘ element is Off (y+, 0)/09;. 
The solution to these equations is @ygm. Although conditions (9.97) look 
very similar to conditions (9.79), the covariance matrix is, in general, a good 
deal more complicated. 


From (9.97), it can be seen that the instruments effectively used by the MSM 
estimator are W X-!(n-1W Fg), where FX = F*(0o), and a factor of n~! 
has been used to keep the expression of order unity as n — oo. If we think of 
the effective instruments as Z = WJ, then J = 3~1(n-!W' Fe). 


The asymptotic covariance matrix of n!/ 2(Ousm — ĝo) can now be found by 
using the general formula (9.78) for the asymptotic covariance matrix of an 
efficient GMM estimator with unknown covariance matrix. This is a sandwich 
estimator of the form A` 1B A`}, and we find that 


A = plim(n7 FX’ W)S-1(n7-!W'F*%), and 


n— oo 7 . O . 9.98 
B= plim (n`! Fy W) Etn W QW )E (n W FÈ), i ) 


n—oo 


where £2 is the n x n covariance matrix of f*(0o). 
The ti'® element of F*(0) is, from (9.94), 


S 
? 1 Om; (už, 0 
ORDD ge 


If mf is differentiable with respect to 0 in a neighborhood of 0, then we can 
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differentiate the relation E(mž (už,0)) = m;(0) to find that 


Om; (už, 0)\ _ Om;(@) 
e( 06; J= ðb 


We denote by M (0) the n x k matrix with typical element 0m,/00;(@). By 
a law of large numbers, we then see that plimn-!W' Fi = plimn-!W'Mo, 
where Mo = M(6o). 

Consider next the covariance matrix 2 of f*(@9). The original data y+ are of 
course completely independent of the simulated už., and the simulated data 
are independent across simulations. Thus, from (9.94), we see that 


Q = Var(h(y)) + = Var(m" (60)). (9.99) 


where h(y) and m*(@) are the n-vectors with typical elements h;:(y,) and 
mž(už,0), respectively. We see that the covariance matrix @ has two com- 
ponents, one due to the randomness of the data and the other due to the 
randomness of the simulations. If the simulator m/(-) is the simple one sug- 
gested above, then the simulated data h;(y/) are generated from the DGP 
characterized by 0, which is also supposed to have generated the real data. 
Therefore, it is clear that Var(h(y)) = Var(m*(@)), and we conclude that 
N = (1+ 1/S)Var(h(y)). 

In general, the n x n matrix (2 cannot be estimated consistently, but an 
HCCME or HAC estimator can be used to provide a consistent estimate of X, 
the covariance matrix of n~!/?W'f*(@9). For the simple simulator we have 
been discussing, £ will just be 1 + 1 /S times whatever HAC estimator or 
HCCME would be appropriate if there were no simulation involved. For 
other simulators, it may be a little harder to estimate (9.99). In any case, 
once X is available, we use it to replace n~!'W'QW in (9.98). We also 
replace plim n-1W' Fè by plimn-!W'Mb. The sandwich estimator for the 
asymptotic covariance matrix then simplifies greatly, and we find that the 
asymptotic covariance matrix is just 


plim (a'm W) S- (nW TMo)) 7 
n—oo 

In practice, Mo can be estimated using either analytical or numerical deriva- 
tives of (1/5) ae mš (uš, 0), evaluated at Aysm. However, for this to be a 
reliable estimator, it is necessary for S to be reasonably large. If we let M 
denote the estimate of Mo, then in practice we will use 


Var(Ousm) = n(M'WS-!WwiM)y!. (9.100) 


Notice that (9.100) has essentially the same form as (9.41) and (9.81), the esti- 
mated covariance matrices for the feasible efficient GMM estimators of linear 
regression and general nonlinear models, respectively. The most important 
new feature of (9.100) is the factor of 1 + 1/5, which is buried in X. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


9.6 The Method of Simulated Moments 385 


The Lognormal Distribution: An Example 


Since the implementation of MSM estimation typically involves several steps 
and can be rather tricky, we now work through a simple example in detail. 
The example is in fact sufficiently simple that there is no need for simulation 
at all; we can work out the “right answer” directly. This provides a benchmark 
with which to compare the various other estimators that we consider. In order 
to motivate these other estimators, we demonstrate how GMM can be used to 
match moments of distributions. Moment matching can be done quite easily 
when the moments to be matched can be expressed analytically as functions 
of the parameters to be estimated, and no simulation is needed in such cases. 
If analytic expressions are not available, moment matching can still be done 
whenever we can simulate the random variables of which the expectations are 
the moments to be matched. 


A random variable is said to follow the lognormal distribution if its logarithm 
is normally distributed. The lognormal distribution for a scalar random vari- 
able y thus depends on just two parameters, the expectation and the variance 
of log y. Formally, if z ~ N(j, 07), then the variable y = exp(z) is lognormally 
distributed, with a distribution characterized by u and o°. 


Suppose we have an n-vector y, of which the components y; are IID, each 
lognormally distributed with unknown parameters u and o°. The “right” way 
to estimate these unknown parameters is to take logs of each component of y, 
thus obtaining an n-vector z with typical element z+, and then to estimate ju 
and g? by the sample mean and sample variance of the z;. This can be done 
by regressing z on a constant. 


The above estimation method implicitly matches the first and second moments 
of the log of y; in order to estimate the parameters. It yields the parameter 
values that give theoretical moments equal to the corresponding moments 
in the sample. Since we have two parameters to estimate, we need at least 
two moments. But other sets of two moments could also be used in order to 
obtain method of moments estimators of u and a”. So could sets of more than 
two moments, although the match could not be perfect, because there would 
implicitly be overidentifying restrictions. 


We now consider precisely how we might estimate u and o? by matching the 
first moment of the y+ along with the first moment of the z+. With this choice, 
it is once more possible to obtain an analytical answer, because, as the result 
of Exercise 9.19 shows, the expectation of y+ is exp(u + 407). Thus, as before, 
we estimate u by using Z, the sample mean of the z4, and then estimate o? 
by solving the equation 


logy = 2+ 56° 
for G7, where y is the sample mean of the ys. The estimate is 
ô? = 2(log y — Z). (9.101) 
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This estimate will not, except by random accident, be numerically equal to 
the estimate obtained by regressing z on a constant, and in fact it has a higher 
variance; see Exercises 9.20 and 9.21. 


Let us formalize the estimation procedure described above in terms of zero 
functions and GMM. The moments used are the first moments of the y; and 
the z, for t =1,...,n. For each observation, then, there are two elementary 
zero functions, which serve to express the expectations of the y+ and the z; in 
terms of the parameters u and 07. We write these elementary zero functions 
as follows: 


falze yo?) = ze- u; fia(yes p, 0°) = ye — exp(u+ 307). (9.102) 
The derivatives of these functions with respect to the parameters are 
bfa _y, faco fe ento, OFi2 ieta? (9.103) 
ðu Oo? ðu Oo? 2 


These derivatives, which are all deterministic, allow us to find the optimal 
instruments for the estimation of u and o? on the basis of the zero func- 
tions (9.102), provided that we can also obtain the covariance matrix 2 of 
the zero functions. 


Notice that, in contrast to many GMM estimation procedures, this one in- 
volves two elementary zero functions and no instruments. Nevertheless, we 
can set the problem up so that it looks like a standard one. Let fi(j,07) 
and fo(y,07) be two n-vectors with typical components fr (z+, 4,07) and 
fi2(Yt, L, 07), respectively. For notational simplicity, we suppress the explicit 
dependence of these vectors on the y; and the z+. The 2n-vector f(u, o°) of 
the full set of elementary zero functions, and the 2n x 2 matrix F(u, o?) of 
the derivatives with respect to the parameters, can thus be written as 


f(y.) = aed and F(j,02) = — | ar I (9.104) 


where a = exp(y + 1/207). The constant vectors + in F(,07) arise because 
none of the derivatives in (9.103) depends on t, which is a consequence of the 
assumption that the data are IID. 


Because f(,07) is a 2n-vector, the covariance matrix Q is 2n x 2n. This 
matrix can be written as 


o=e(($ |i sa); 


where fio, i = 1,2, is f; evaluated at the true values po and oĝ. Since the 
data are IID, Q can be partitioned as follows into four n x n blocks, each of 
which is proportional to an identity matrix. The result is 


el Cal 
G= | p |: (9.105) 
Gye) opl 
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where o? = Var(y), o2 = Var(z), and oy, = Cov(y,, z+). 


We now have everything we need to set up the efficient estimating equations 
(9.82), which, ignoring the factor of n~1, become 


F'(u, 07) Qf (u, 07) = 0, (9.106) 


where f(-) and F(-) are given by (9.104), and 2 is given by (9.105). By 
explicitly performing the multiplications of partitioned matrices in (9.106), 
inverting 2, and ignoring irrelevant scalar factors, we obtain 


a llo rallies 


2 
zZ 
lao? JLo a]l fio) 


2 = 
Oy — Oyz ao 


T zys 
Since the leftmost factor above is a 2 x 2 nonsingular matrix, we see that these 
estimating equations are equivalent to 


u'fi(u,o7) =O and e'fo(u,07) = 0. (9.107) 


The solution to these two equations is fs = Z and 6? given by (9.101). Curi- 
ously, it appears that the explicit expressions for F(-) and Q are not needed 
in order to formulate the estimator. They are needed, however, for the evalu- 
ation of expression (9.67) for its asymptotic covariance matrix. This is left as 
an exercise for the reader; in particular, the same expression for the variance 
of G? should be found as in the answer to Exercise 9.21. 


As we mentioned above, it is possible to use more than two moments. Suppose 
that, in addition to matching the first moments of the z; and the y;, we also 
wish to match the second moment of the y;, or, equivalently, the first moment 
of the y?. Since the log of y? is just 2z;, which is distributed as N (2u, 407), 
the expectation of y? is exp (2(u + o”)). We now have three elementary zero 
functions for each observation, the two given in (9.102) and 


fts (Yt, L, a”) = yz — exp (2(u + oe 


The vector f(-) and the matrix F(-), originally defined in (9.104), now both 
have 3n rows. The latter still has two columns, both of which can be parti- 
tioned into three n-vectors, each proportional to v. Further, the matrix 2 
of (9.105) grows to become a 3n x 3n matrix. It is then a matter of taste 
whether to set up a just identified estimation problem using as optimal in- 
struments the two columns of Q7!F (u, 07), or to use three instruments, which 
will be the columns of the matrix 


L. 0 0 
W=]0 ce 0J, (9.108) 
0o Ob 
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and to construct an optimal weighting matrix. Whichever choice is made, it 
is necessary to estimate (2 in order to construct the optimal instruments for 
the first method, or the optimal weighting matrix for the second. 


The procedures we have just described depend on the fact that we know the 
analytic forms of E(z:), E(yz), and E(y?). In more complicated applications, 
comparable analytic expressions for the moments to be matched might not be 
available; see Exercise 9.24 for an example. In such cases, simulators can be 
used to replace such analytic expressions. We illustrate the method for the 
case of the lognormal distribution, matching the first moments of z; and yz, 
pretending that we do not know the analytic expressions for their expectations. 


For any given values of u and o”, we can draw from the lognormal distribution 
characterized by these values by first using a random number generator to 
give a drawing u* from N(0,1) and then computing y* = exp(u + ou*). Thus 
unbiased simulators for the expectations of z = log y and of y itself are 


mi(u*,p,07) =p+ou* and m3(u*,p,07) = exp(utou*). 


If we perform S simulations, the zero functions for MSM estimation can be 
written as 


föl m o? = 4 Dmi Ute H O o°’) and 


Pleie =y- sm (Wis, Hs 0 a); 


where the už, are IID standard normal. Comparison with (9.102) shows clearly 
how we replace analytic expressions for the moments, assumed to be unknown, 
by simulation-based estimates. 


Since the data are IID, it might appear tempting to use just one set of random 
numbers, uz, s = 1,...,5, for all t. However, doing this would introduce 
dependence among the zero functions, greatly complicating the computation 
of their covariance matrix. As S becomes large, of course, the law of large 
numbers ensures that this effect becomes less and less important. Using just 
one set of random numbers would in any case not affect the consistency of the 
MSM estimator, merely that of the covariance matrix estimate. 


By analogy with (9.107), we can see that the MSM estimating equations are 
a f*(f,67) =0 and e'ft(f,67) =0. (9.109) 


Here we have again grouped the elementary zero functions into two n-vectors 
fi(-) and f3(-). Recalling that the random numbers už, are drawn only once 
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for the entire procedure, let us make the definitions 


S S 
l * (> l * 
malu, o’) = Ss X mA] = Por g S Uts, and 


— E (9.110) 
1 S z 1 S 
Miz( u, 0 ) = FL må (uts, u, o ) = 5 D expla + ous.) 
s=1 s=1 


It is clear that, as S — oo, these functions tend for all t to the limits of the 
expectations of z and y, respectively. It is also not hard to see that these 
limits are ys and exp(u + $07). 


On dividing by the sample size n and rearranging, the estimating equa- 
tions (9.109) can be written as 


m(u,07)=Z and mo(u,07) =%, (9.111) 


where Z and Ņ are the sample averages of the z, and the y, respectively, and 
n 
7 1 
Milu, 07) = > X > mu(u,07), i= 1,2. 
t=1 


Equations (9.111) can be solved in various ways. One approach is to turn the 
problem of solving them into a minimization problem. Let 


W= k ol (9.112) 


Then it is not difficult to see that minimizing the quadratic form 
pee een 
>, | WW i 
yY — mM2( u, 0 ) Y — M2( u, 0 ) 


will also solve equations (9.111); see Exercise 9.23. Here the n-vectors mı (-) 
and mə(-) have typical elements m;1(-) and m:9(-), respectively. 


(9.113) 


Alternatively, we can use Newton’s Method directly. We discussed this proce- 
dure in Section 6.4, in connection with minimizing a nonlinear function, but it 
can also be applied to sets of equations like (9.111). Suppose that we wish to 
solve a set of k equations of the form g(@) = 0 for a k-vector of unknowns 8, 
where g(-) is also a k-vector. The iterative step analogous to (6.43) is 


9541) = Oj) — EG" (04) )9(8(5)): (9.114) 


where G(@) is the Jacobian matrix associated with g(@). This k x k matrix 
contains the derivatives of the components of g(@) with respect to the elements 
of 0. For the estimating equations (9.111), the iterative step (9.114) becomes 


Om, Om 
Moti] _ fea] | AH 3 || M(H) eG) -z 
Ob. Oi, Ome iy) , 


Ou do? 
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where all the partial derivatives are evaluated at (ua ai). It should be 
noted that these partial derivatives are known analytically, as they can be 
calculated directly from (9.110). 


To estimate the asymptotic covariance matrix of the MSM estimates, we can 
use any suitable estimator of (9.81), provided we remember to multiply the 
result by 1 + 1/S in order to account for the simulation randomness. The 
instrument matrix W of (9.81) is just the matrix W of (9.112). We are 
pretending that we do not know the analytic form of the matrix F(u, 0°) 
given in (9.104), and so instead we use the matrix of partial derivatives of mı 
and mz, evaluated at ji and G7. This matrix is 


Om... . Om,,. . 

; Bu (A8?) za (Pô?) 

Pa pon ; (9.115) 
a (ji, 6?) aa (ji, 6?) 


note that each block in F is an n-vector. If we used Newton’s Method for the 
estimation, then all the partial derivatives in this matrix will already have been 
computed. Finally, the covariance matrix 2 of the elementary zero functions 
can be estimated using (9.105), by replacing the unknown quantities o2, oF 
and oz, with their sample analogs. If we denote the result of this by 2, then 
our estimate of the covariance matrix of ji and G? is 


A 


H 
G2 


Var | | =(W'F)'W'Qw(F wy, (9.116) 


with W given by (9.112) and F given by (9.115). 


MSM Estimation: Conclusion 


Although it is very special, the example of the previous subsection illustrates 
most of the key features of MSM estimation. The example shows how to 
estimate two parameters by using two or more elementary zero functions, 
even when there are no genuine instruments. In econometric applications, it 
is more common for there to be as many elementary zero functions as there 
are dependent variables, just one in the case of univariate models, and for 
there to be more instruments than parameters. Also, in many applications, 
the data will not be IID, but this complication generally does not require 
substantial changes to the methods illustrated above. 


Inference in models estimated by MSM is almost always based on asymptotic 
theory, and it may therefore be quite unreliable in finite samples. Since MSM 
estimation makes sense only when a model is too intractable for less compu- 
tationally demanding methods to be applicable, the cost of estimating such 
a model a large number of times, as would be needed to employ bootstrap 
methods, is likely to be prohibitive. 
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Not surprisingly, the literature on MSM is relatively recent. The two classic 
papers are McFadden (1989), who seems to have coined the name, and Pakes 
and Pollard (1989). Other important early papers include Lee and Ingram 
(1991), Keane (1994), McFadden and Ruud (1994), and Gallant and Tauchen 
(1996). An interesting early application of the method is Duffie and Singleton 
(1993). Useful references include Hajivassiliou and Ruud (1994), Gouriéroux 
and Monfort (1996), and van Dijk, Monfort, and Brown (1995), which is a 
collection of papers, both theoretical and applied. 


9.7 Final Remarks 


As its name implies, the generalized method of moments is a very general 
estimation method indeed, and numerous other methods can be thought of 
as special cases. These include all of the ones we have discussed so far: MM, 
OLS, NLS, GLS, and IV. Thus the number of techniques that can legitimately 
be given the label “GMM” is bewilderingly large. To avoid bewilderment, it 
is best not to attempt to enumerate all the possibilities, but simply to list 
some of the ways in which various GMM estimators differ: 


e Methods for which the explanatory variables are exogenous or predeter- 
mined (including OLS, NLS, and GLS), and for which no extra instru- 
ments are required, versus methods that do require additional exogenous 
or predetermined instruments (including linear and nonlinear IV). 


e Methods for linear models (including OLS, GLS, linear IV, and the GMM 
techniques discussed in Section 9.2) versus methods for nonlinear models 
(including NLS, GNLS, nonlinear IV, and the GMM techniques discussed 
in Section 9.5). 


e Methods that are inefficient for a given set of moment conditions, which 
will have sandwich covariance matrices, versus methods that are efficient 
for the same set of moment conditions, which will not. 


e Methods that are fully efficient, because they are based on optimal in- 
struments, versus methods that are not fully efficient. 


e Methods based on a covariance matrix that is known, at least up to a 
finite number of parameters which can be estimated consistently, versus 
methods that require an HCCME or a HAC estimator. The latter can 
never be fully efficient. 


e Methods that involve simulation, such as MSM, versus methods where 
the criterion function can be evaluated analytically. 


e Univariate models versus multivariate models. We have not yet discussed 
any methods for estimating the latter, but we will do so in Chapter 12. 
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9.8 Exercises 


9.1 


9.2 


9.3 


9.4 


9.5 


Show that the difference between the matrix 


(J'Ww'x)ts'wlews(x' way! 


and the matrix 
(x'w(w'aw)y'w'x)! 
is a positive semidefinite matrix. Hints: Recall Exercise 3.8. Express the 


second of the two matrices in terms of the projection matrix Pgoi/zy,, and 
then find a similar projection matrix for the first of them. 


Let the n-vector u be such that E(w) = 0 and E(uu! ) = I, and let the n x l 
matrix W be such that E(Wiuzt) = O and that E(urus | Wt, Ws) = ts, where 
dts is the Kronecker delta introduced in Section 1.4. Assume that SwTtw = 
plim n-'W!W is finite, deterministic, and positive definite. Explain why 
the quadratic form u'Pywu must be asymptotically distributed as x7 (1). 


Consider the quadratic form xlAx, where æ is a p x 1 vector and A isa 
p x p matrix, which may or may not be symmetric. Show that there exists a 
symmetric p x p matrix B such that x' Bx = x'Az for all p X 1 vectors x, 
and give the explicit form of a suitable B. 


For the model (9.01) and a specific choice of the | x k matrix J, show that 
minimizing the quadratic form (9.12) with weighting matrix A = In gives 
the same estimator as solving the moment conditions (9.05) with the given J. 
Assuming that these moment conditions have a unique solution for G, show 
that the matrix JJ! is of rank k, and hence positive semidefinite without 
being positive definite. 


Construct a symmetric, positive definite, | x l weighting matrix A such that 
minimizing (9.12) with this A leads once more to the same estimator as that 
given by solving conditions (9.05). It is convenient to take A in the form 
JJ! + NN". In the construction of N, it may be useful to partition W as 
[W, Wa], where the n x k matrix W31 is such that Wi! X is nonsingular. 


Consider the linear regression model with serially correlated errors, 
Yt = 1 + b22 + ut, Ut = put—1 + Et, (9.117) 


where the e+ are IID, and the autoregressive parameter p is assumed either 
to be known or to be estimated consistently. The explanatory variable x; is 
assumed to be contemporaneously correlated with e¢ (see Section 8.4 for the 
definition of contemporaneous correlation). 


Recall from Chapter 7 that the covariance matrix @ of the vector u with 
typical element uz is given by (7.32), and that NT! can be expressed as vy" 
where W is defined in (7.59). Express the model (9.117) in the form (9.20), 
without taking account of the first observation. 


Let Q; be the information set for observation t with E(er | Q+) = 0. Suppose 
that there exists a matrix Z of instrumental variables, with Z+ € Qz, such that 
the explanatory vector æ with typical element x; is related to the instruments 
by the equation 

£z = ZT +v, (9.118) 
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9.7 


9.8 


where E(v; | Qt) = 0. Derive the explicit form for the model (9.117) of the 
expression (Y! X); defined implicitly by equation (9.24). Find a matrix W of 
instruments that satisfy the predeterminedness condition in the form (9.30) 
and that lead to asymptotically efficient estimates of the parameters (1; and (2 
computed on the basis of the theoretical moment conditions (9.31) with your 
choice of W. 


Consider the model (9.20), where the matrix W is chosen in such a way that 
the transformed error terms, the (Plu), are innovations with respect to 
the information sets Q4. In other words, E((w lu) |Q4) = 0. Suppose that 
the n x | matrix of instruments W is predetermined in the usual sense that 
Wi € Qi. Show that these assumptions, along with the assumption that 
E((W'u)?|Q2) = E((W'u)?) = 1 for t = 1,...,n, are enough to prove the 
analog of (9.02), that is, that 


Var(n71/? W'w'u) = n 'E(W'W). 


In order to perform just-identified estimation, let the n x k matrix Z = WJ, 
for an l xk matrix J of full column rank. Compute the asymptotic covariance 
matrix of the estimator obtained by solving the moment conditions 


Z'w'(y— XB) =J'w'w'(y— XB) =0. (9.119) 


The covariance matrix you have found will be a sandwich. Find the choice 
of J that eliminates the sandwich, and show that this choice leads to an 
asymptotic covariance matrix that is smaller, in the usual sense, than the 
asymptotic covariance matrix for any other choice of J. 


Compute the GMM criterion function for model (9.20) with instruments W, 
and show that the estimator found by minimizing this criterion function is 
just the estimator obtained using the optimal choice of J. 


Compare the asymptotic covariance matrix found in the preceding question 
for the estimator of the parameters of model (9.20), obtained by minimizing 
the GMM criterion function for the n x | matrix of predetermined instru- 
ments W, with the covariance matrix (9.29) that corresponds to estimation 
with instruments ¥'X. In particular, show that the difference between the 
two is a positive semidefinite matrix. 


Consider overidentified estimation based on the moment conditions 
E(W'Q7"(y — XB)) =0, 


which were given in (9.31), where the n x / matrix of instruments W satisfies 
the predeterminedness condition (9.30). Derive the GMM criterion function 
for these theoretical moment conditions, and show that the estimating equa- 
tions that result from the minimization of this criterion function are 


x'a'wiw'a'w)y'w'al(y— XB) =0. (9.120) 


Suppose that $(X), the span of the n x k matrix X of optimal instruments 
defined by (9.24), is a linear subspace of S(W ), the span of the transformed 
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instruments. Show that, in this case, the estimating equations (9.120) are 
asymptotically equivalent to 


X'y - XB) =0, 


of which the solution is the efficient estimator Êgpamm defined in (9.26). 


Show that the asymptotic covariance matrix of the estimator obtained by 
solving the estimating equations (9.120) is 
= hei 
pli (Z'a ww aww aa) (9.121) 
n— oo 


By expressing this asymptotic covariance matrix in terms of a matrix W that 
satisfies the equation RTI = vy" show that the difference between it and 
the asymptotic covariance matrix of the efficient estimator Bram of (9.26) 
is a positive semidefinite matrix. 


Give the explicit form of the n x n matrix U(j) for which I'(j), defined 
in (9.36), takes the form n—'W'U(j)W. 


This question uses data on daily returns for the period 1989-1998 from the 
file daily-crsp.data. These data are made available by courtesy of the Center 
for Research in Security Prices (CRSP); see the comments at the bottom of 
the file. Let rz denote the daily return on shares of Mobil Corporation, and 
let v; denote the daily return for the CRSP value-weighted index. Using all 
but the first four observations (to allow for lags), run the regression 


re = b1 + Bove + ut 


by OLS. Report three different sets of standard errors: the usual OLS ones, 
ones based on the simplest HCCME, and ones based on a more advanced 
HCCME that corrects for the downward bias in the squared OLS residuals; 
see Section 5.5. Do the OLS standard errors appear to be reliable? 


Assuming that the uz are heteroskedastic but serially uncorrelated, obtain 
estimates of the 8; that are more efficient than the OLS ones. For this purpose, 
use r2, ve, v, and vo as additional instruments. Do these estimates 
appear to be more efficient than the OLS ones? 


Using the data for consumption (C+) and disposable income (Y+) contained in 
the file consumption.data, construct the variables c¢ = log Cz, ACt = Ct—Ct—1, 
yt = log Yı, and Ayt = yt — yt—1. Then, for the period 1953:1 to 1996:4, run 
the regression 

Act = By + b2Ayt + B3Aye—1 + ut (9.122) 
by OLS, and test the hypothesis that the uz are serially uncorrelated against 
the alternative that they follow an AR(1) process. 


Calculate eight sets of HAC estimates of the standard errors of the OLS 
parameter estimates from regression (9.122), using the Newey-West estimator 
with the lag truncation parameter set to the values p = 1, 2,3, 4, 5,6, 7,8. 


Using the squares of Ay;, Ay+—1, and Ac;_ 1 as additional instruments, obtain 
feasible efficient GMM estimates of the parameters of (9.122) by minimizing 
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9.14 


9.15 


9.16 


9.17 


9.18 


the criterion function (9.42), with y given by the HAC estimators computed 
in the previous exercise. For p = 6, carry out the iterative procedure described 
in Section 9.3 by which new parameter estimates are used to update the HAC 
estimator, which is then used to update the parameter estimates. Warning: It 
may be necessary to rescale the instruments so as to avoid numerical problems. 


Suppose that ft = yt — X+ B. Show that, in this special case, the estimating 
equations (9.77) yield the generalized IV estimator. 


Starting from the asymptotic covariance matrix (9.67), show that, when 
Q-'Fo is used in place of Z, the covariance matrix of the resulting esti- 
mator is given by (9.83). Then show that, for the linear regression model 
y = XP + u with exogenous explanatory variables X, this estimator is the 
GLS estimator. 


The minimization of the GMM criterion function (9.87) yields the estimating 
equations (9.89) with A = Y| W. Assuming that the n x l instrument matrix 
W satisfies the predeterminedness condition in the form (9.30), show that 
these estimating equations are asymptotically equivalent to the equations 


Fo UPgtwW 'f (8) = 0, (9.123) 


where, as usual, Fy = F (60), with 09 the true parameter vector. Next, derive 
the asymptotic covariance matrix of the estimator defined by these equations. 


Show that the equations (9.123) are the optimal estimating equations for 
overidentified estimation based on the transformed zero functions W'f (0) 
and the transformed instruments ¥'W. Show further that, if the condition 
S(F) C 8(W) is satisfied, the asymptotic covariance matrix of the estimator 
obtained by solving equations (9.123) coincides with the optimal asymptotic 
covariance matrix (9.83). 


Suppose the n-vector f(@) of elementary zero functions has a covariance 
matrix 071. Show that, if the instrumental variables used for GMM estimation 
are the columns of the n x | matrix W, the GMM criterion function is 


+ f'(0)Pw (0). (9.124) 


Next, show that, whenever the instruments are predetermined, the artificial 
regression 
f (0) = —Pw F (0)b + residuals, (9.125) 


where F'(@) is defined as usual by (9.63), satisfies all the requisite properties 
for hypothesis testing. These properties, which are spelt out in detail in 
Exercise 8.20 in the context of the IVGNR, are that the regressand should be 
orthogonal to the regressors when they are evaluated at the GMM estimator 
obtained by minimizing (9.124); that the OLS covariance matrix from (9.125) 
should be a consistent estimate of the asymptotic variance of that estimator; 
and that (9.125) should admit one-step estimation. 


Derive a heteroskedasticity robust version of the artificial regression (9.125), 
assuming that the covariance matrix of the vector f(@) of zero functions is 
diagonal, but otherwise arbitrary. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


396 


9.19 


9.20 


9.21 


9.22 


9.23 


9.24 


The Generalized Method of Moments 


If the scalar random variable z is distributed according to the N(,, o?) dis- 
tribution, show that 
E(e*) = exp(u + 50°). 


Let the components z+ of the n-vector z be IID drawings from the N(y, a”) 
distribution, and let s? be the OLS estimate of the error variance from the 


regression of z on the constant vector +. Show that the variance of s? is 
204 /(n — 1). 
Would this result still hold if the normality assumption were dropped? With- 


out this assumption, what would you need to know about the distribution of 
the z; in order to find the variance of 87? 


Using the delta method, obtain an expression for the asymptotic variance of 
the estimator defined by (9.101) for the variance of the normal distribution 
underlying a lognormal distribution. Show that this asymptotic variance is 
greater than that of the sample variance of the normal variables themselves. 


Describe the two procedures by which the parameters and o? of the log- 
normal distribution can be estimated by the method of simulated moments, 
matching the first and second moments of the lognormal variable itself, and 
the first moment of its log. The first procedure should use optimal instru- 
ments and be just identified; the second should use the simple instruments 
of (9.108) and be overidentified. 


Show that minimizing the criterion function (9.113), when W is defined in 
(9.112), is equivalent to solving equations (9.111). Then show that it is also 
equivalent to minimizing the criterion function 


29T 
kee ) (9.126) 


2 
y —mo(p a | wow wy tw" [Fomine | 


y — mo(u, 0°) 


which is the criterion function for nonlinear IV estimation. 


The Singh-Maddala distribution is a three-parameter distribution which has 
been shown to give an acceptable account, up to scale, of the distributions 


of household income in many countries. It is characterized by the following 
CDF: 


1 


PO ~ Fay 


, y>O0,a>0,b>0,c>0. (9.127) 


Suppose that you have at your disposal the values of the incomes of a random 
sample of households from a given population. Describe in detail how to use 
this sample in order to estimate the parameters a, b, and c of (9.127) by 
the method of simulated moments, basing the estimates on the expectations 
of y, logy, and ylog y. Describe how to construct a consistent estimate of the 
asymptotic covariance matrix of your estimator. 
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10.1 Introduction 


The method of moments is not the only fundamental principle of estimation, 
even though the estimation methods for regression models discussed up to 
this point (ordinary, nonlinear, and generalized least squares, instrumental 
variables, and GMM) can all be derived from it. In this chapter, we introduce 
another fundamental method of estimation, namely, the method of maximum 
likelihood. For regression models, if we make the assumption that the error 
terms are normally distributed, the maximum likelihood, or ML, estimators 
coincide with the various least squares estimators with which we are already 
familiar. But maximum likelihood can also be applied to an extremely wide 
variety of models other than regression models, and it generally yields esti- 
mators with excellent asymptotic properties. The major disadvantage of ML 
estimation is that it requires stronger distributional assumptions than does 
the method of moments. 


In the next section, we introduce the basic ideas of maximum likelihood esti- 
mation and discuss a few simple examples. Then, in Section 10.3, we explore 
the asymptotic properties of ML estimators. Ways of estimating the covar- 
iance matrix of an ML estimator will be discussed in Section 10.4. Some 
methods of hypothesis testing that are available for models estimated by 
ML will be introduced in Section 10.5 and discussed more formally in Sec- 
tion 10.6. The remainder of the chapter discusses some useful applications 
of maximum likelihood estimation. Section 10.7 deals with regression models 
with autoregressive errors, and Section 10.8 deals with models that involve 
transformations of the dependent variable. 


10.2 Basic Concepts of Maximum Likelihood Estimation 


Models that are estimated by maximum likelihood must be fully specified 
parametric models, in the sense of Section 1.3. For such a model, once the 
parameter values are known, all necessary information is available to simulate 
the dependent variable(s). In Section 1.2, we introduced the concept of the 
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probability density function, or PDF, of a scalar random variable and of the 
joint density function, or joint PDF, of a set of random variables. If we can 
simulate the dependent variable, this means that its PDF must be known, both 
for each observation as a scalar r.v., and for the full sample as a vector r.v. 


As usual, we denote the dependent variable by the n-vector y. For a given 
k-vector 0 of parameters, let the joint PDF of y be written as f(y,@). This 
joint PDF constitutes the specification of the model. Since a PDF provides 
an unambiguous recipe for simulation, it suffices to specify the vector @ in 
order to give a full characterization of a DGP in the model. Thus there is a 
one to one correspondence between the DGPs of the model and the admissible 
parameter vectors. 


Maximum likelihood estimation is based on the specification of the model 
through the joint PDF f(y,@). When @ is fixed, the function f(-,0) of y 
is interpreted as the PDF of y. But if instead f(y,@) is evaluated at the 
n-vector y found in a given data set, then the function f(y,-) of the model 
parameters can no longer be interpreted as a PDF. Instead, it is referred to as 
the likelihood function of the model for the given data set. ML estimation then 
amounts to maximizing the likelihood function with respect to the parameters. 
A parameter vector Ô at which the likelihood takes on its maximum value is 
called a maximum likelihood estimate, or MLE, of the parameters. 


In many cases, the successive observations in a sample are assumed to be 
statistically independent. In that case, the joint density of the entire sample 
is just the product of the densities of the individual observations. Let f(y, 0) 
denote the PDF of a typical observation, y+. Then the joint density of the 
entire sample y is 


f(y.) = ] | fw. 6). (10.01) 


Because (10.01) is a product, it will often be a very large or very small number, 
perhaps so large or so small that it cannot easily be represented in a computer. 
For this and a number of other reasons, it is customary to work instead with 
the loglikelihood function 


&(y,8) = log f(y, 0) = Y 4 (u, 0), (10.02) 


where ¢;(y, 0), the contribution to the loglikelihood function made by obser- 
vation t, is equal to log f:(y+, 0). The t subscripts on f; and 4; have been added 
to allow for the possibility that the density of y may vary from observation 
to observation, perhaps because there are exogenous variables in the model. 
Whatever value of O maximizes the loglikelihood function (10.02) will also 
maximize the likelihood function (10.01), because (y, @) is just a monotonic 
transformation of f(y, @). 
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Figure 10.1 The exponential distribution 


The Exponential Distribution 


As a simple example of ML estimation, suppose that each observation y; is 
generated by the density 


f(y, 9) = 0c", y >0, @>0. (10.03) 


This is the PDF of what is called the exponential distribution.! This density 
is shown in Figure 10.1 for three values of the parameter 0, which is what we 
wish to estimate. There are assumed to be n independent observations from 
which to calculate the loglikelihood function. 


Taking the logarithm of the density (10.03), we find that the contribution to 
the loglikelihood from observation t is ¢;(yz,0) = log @ — O y+. Therefore, 


n 


L(y, 0) =) (log0 — 0y) = nlogð -0 X` y (10.04) 


t=1 t=1 


To maximize this loglikelihood function with respect to the single unknown 
parameter 0, we differentiate it with respect to 0 and set the derivative equal 
to 0. The result is 


n n 
z Lue = 0, (10.05) 
t=1 


which can easily be solved to yield 


n 


6 = —__. 
ea 


(10.06) 


1 The exponential distribution is useful for analyzing dependent variables which 
must be positive, such as waiting times or the duration of unemployment. 
Models for duration data will be discussed in Section 11.8. 
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This solution is clearly unique, because the second derivative of (10.04), which 
is the first derivative of the left-hand side of (10.05), is always negative, which 
implies that the first derivative can vanish at most once. Since it is unique, the 
estimator 6 defined in (10.06) can be called the maximum likelihood estimator 
that corresponds to the loglikelihood function (10.04). 


In this case, interestingly, the ML estimator Ê is the same as a method of 
moments estimator. As we now show, the expected value of y, is 1/0. By 
definition, this expectation is 


E(u) = f yebe?” dys. 
0 


Since —Oe~°% is the derivative of e~°% with respect to y,, we may integrate 
by parts to obtain 


f yrde 4 dy, = — [ye] +f e dy, = |- 07e = 07t, 
0 0 0 0 


The most natural MM estimator of 6 is the one that matches 67! to the 
empirical analog of E(y;), which is y, the sample mean. This estimator of 0 
is therefore 1/y, which is identical to the ML estimator (10.06). 


It is not uncommon for an ML estimator to coincide with an MM estimator, as 
happens in this case. This may suggest that maximum likelihood is not a very 
useful addition to the econometrician’s toolkit, but such an inference would 
be unwarranted. Even in this simple case, the ML estimator was considerably 
easier to obtain than the MM estimator, because we did not need to calculate 
an expectation. In more complicated cases, this advantage of ML estimation 
is often much more substantial. Moreover, as we will see in the next three 
sections, the fact that an estimator is an MLE generally ensures that it has 
a number of desirable asymptotic properties and makes it easy to calculate 
standard errors and test statistics.” 


Regression Models with Normal Errors 


It is interesting to see what happens when we apply the method of maximum 
likelihood to the classical normal linear model 


y=XBt+u, u~ N(0,07D), (10.07) 


which was introduced in Section 3.1. For this model, the explanatory variables 
in the matrix X are assumed to be exogenous. Consequently, in constructing 


2 Notice that the abbreviation “MLE” here means “maximum likelihood esti- 
mator” rather than “maximum likelihood estimate.” We will use “MLE” to 
mean either of these. Which of them it refers to in any given situation should 
generally be obvious from the context; see Section 1.5. 
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the likelihood function, we may use the density of y conditional on X. The 
elements u+ of the vector u are independently distributed as N(0, 07), and so 
yt is distributed, conditionally on X, as N(X;3,07). Thus the PDF of y is, 
from (4.10), 


(Ye = =e"). (10.08) 


aa % 


The contribution to the loglikelihood function made by the tt! observation is 


the logarithm of (10.08). Since logo = + log o7, this can be written as 


= (ye XBY’. (10.09) 


£:(yt, B0) = 5 log 2m 5 logo? 


Since the observations are assumed to be independent, the loglikelihood func- 
tion is just the sum of these contributions over all t, or 


1 n 
Iy, 8,0) = — log 2x — Floga” — 55 > (h — XB) 
t=1 (10.10) 


== Flog 2a — Floga* =; zly- XB)'(y — XB). 


20? 
In the second line, we rewrite the sum of squared residuals as the inner product 
of the residual vector with itself. To find the ML estimator, we need to 
maximize (10.10) with respect to the unknown parameters 8 and ø. 


The first step in maximizing L(y, B, a) is to concentrate it with respect to the 
parameter ø. This means differentiating (10.10) with respect to ø, solving 
the resulting first-order condition for o as a function of the data and the 
remaining parameters, and then substituting the result back into (10.10). 
The concentrated loglikelihood function that results will then be maximized 
with respect to 8. For models that involve variance parameters, it is very 
often convenient to concentrate the loglikelihood function in this way. 


Differentiating the second line of (10.10) with respect to o and equating the 
derivative to zero yields the first-order condition 


Ally, P, 1 
ws 7) = -z (y — XP) ' (y — XB) =0, 


and solving this yields the result that 
ô?(6) = 3 (y — XB) (y - XP). 


Here the notation ĉ?° (6) indicates that the value of o? that maximizes (10.10) 
depends on 8. 
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Substituting ¢7(@) into the second line of (10.10) yields the concentrated 
loglikelihood function 


(y, B) = — Blog 2m — Blog(4(y— X8)"(y— XB))-2. (10.11) 


The middle term here is minus n/2 times the logarithm of the sum of squared 
residuals, and the other two terms do not depend on 8. Thus we see that 
maximizing the concentrated loglikelihood function (10.11) is equivalent to 
minimizing the sum of squared residuals as a function of 8. Therefore, the 
ML estimator 3 must be identical to the OLS estimator. 


Once has been found, the ML estimate 6? of o? is 6?(3), and the MLE of o 
is the positive square root of ¢?. Thus, as we saw in Section 3.6, the MLE ô? is 
biased downward.’ The actual maximized value of the loglikelihood function 
can then be written in terms of the sum-ofsquared residuals function SSR 
evaluated at 3. From (10.11) we have 


L(y, 8,6) = — (1 + log 2r — log n) — Slog SSR(B), (10.12) 


where SSR() denotes the minimized sum of squared residuals. 
Although it is convenient to concentrate (10.10) with respect to g, as we have 
done, this is not the only way to proceed. In Exercise 10.1, readers are asked 
to show that the ML estimators of 8 and o can be obtained equally well by 
concentrating the loglikelihood with respect to 6 rather than ø. 


The fact that the ML and OLS estimators of @ are identical depends critically 
on the assumption that the error terms in (10.07) are normally distributed. If 
we had started with a different assumption about their distribution, we would 
have obtained a different ML estimator. The asymptotic efficiency result to 
be discussed in Section 10.4 would then imply that the least squares estimator 
is asymptotically less efficient than the ML estimator whenever the two do 
not coincide. 


The Uniform Distribution 


As a final example of ML estimation, we consider a somewhat pathological, 
but rather interesting, example. Suppose that the y; are generated as indepen- 
dent realizations from the uniform distribution with parameters 3, and 2, 
which can be written as a vector 8; a special case of this distribution was 
introduced in Section 1.2. The density function for y+, which is graphed in 


3 The bias arises because we evaluate SSR(G) at B instead of at the true value Jo. 
However, if one thinks of ô as an estimator of ø, rather than of 6? as an 
estimator of o°, then it can be shown that both the OLS and the ML estimators 
are biased downward. 
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f(y, B) 


po = Pi 


bı b2 


Figure 10.2 The uniform distribution 


Figure 10.2, is 
f(y B) =0 if Yt < pı, 


f(y, 8) = z7 if 61 < Yt < Bo, 


Fun B) = 0 if y: > Bo. 


Provided that 61 < ye < bə for all observations, the likelihood function is 
equal to 1/(82 — 61)”, and the loglikelihood function is therefore 


L(y, B) = —n log (b2 — b1). 


It is easy to verify that this function cannot be maximized by differentiating 
it with respect to the parameters and setting the partial derivatives to zero. 
Instead, the way to maximize ¢(y, B) is to make 32 — 3; as small as possible. 
But we clearly cannot make ( larger than the smallest observed y;, and we 
cannot make (2 smaller than the largest observed y+. Otherwise, the likelihood 
function would be equal to 0. It follows that the ML estimators are 


Âi =min(y,) and £2 = max(y). (10.13) 


These estimators are rather unusual. For one thing, they will always lie on 
one side of the true value. Because all the y, must lie between (3; and 62, 
it must be the case that By > ßıio and Bo < Boo, where 319 and B20 denote 
the true parameter values. However, despite this, these estimators turn out 
to be consistent. Intuitively, this is because, as the sample size gets large, the 
observed values of y, fill up the entire space between 610 and (29. 


The ML estimators defined in (10.13) are super-consistent, which means that 
they approach the true values of the parameters they are estimating at a 
rate faster than the usual rate of n—!/2. Formally, n!/2 (ĝi — (49) tends to 
zero as n — oo, while ny — ĝßı0) tends to a limiting random variable; see 
Exercise 10.2 for more details. Now consider the parameter y = $(A1 + b2). 
One way to estimate it is to use the ML estimator 


7 = 3 (Êi + he). 
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Another approach would simply be to use the sample mean, say 7, which is 
a least squares estimator. But the ML estimator ¥ will be super-consistent, 
while 7 will only be root-n consistent. This implies that, except perhaps 
for very small sample sizes, the ML estimator will be very much more effi- 
cient than the least squares estimator. In Exercise 10.3, readers are asked to 
perform a simulation experiment to illustrate this result. 


Although economists rarely need to estimate the parameters of a uniform 
distribution directly, ML estimators with properties similar to those of (10.13) 
do occur from time to time. In particular, certain econometric models of 
auctions lead to super-consistent ML estimators; see Donald and Paarsch 
(1993, 1996). However, because these estimators violate standard regularity 
conditions, such as those given in Theorems 8.2 and 8.3 of Davidson and 
MacKinnon (1993), we will not consider them further. 


Two Types of ML Estimator 


There are two different ways of defining the ML estimator, although most 
MLEs actually satisfy both definitions. A Type 1 ML estimator maximizes 
the loglikelihood function over the set ©, where © denotes the parameter 
space in which the parameter vector @ lies, which is generally assumed to be 
a subset of R*. This is the natural meaning of an MLE, and all three of the 
ML estimators just discussed are Type 1 estimators. 


If the loglikelihood function is differentiable and attains an interior maximum 
in the parameter space, then the MLE must satisfy the first-order conditions 
for a maximum. A Type 2 ML estimator is defined as a solution to the 
likelihood equations, which are just the following first-order conditions: 


A 


g(y, 9) =0, (10.14) 


where g(y, 9) is the gradient vector, or score vector, which has typical element 


= 2U O) _ 5 Belu 0) 


gi(Y, 0) = ə a0, (10.15) 


t=1 


Because there may be more than one value of @ that satisfies the likelihood 
equations (10.14), the definition further requires that the Type 2 estimator 6 
be associated with a local maximum of ¢(y,@) and that, as n — oo, the 
value of the loglikelihood function associated with 6 be higher than the value 
associated with any other root of the likelihood equations. 


The ML estimator (10.06) for the parameter of the exponential distribution 
and the OLS estimators of B and g? in the regression model with normal 
errors, like most ML estimators, are both Type 1 and Type 2 MLEs. However, 
the MLEs for the parameters of the uniform distribution defined in (10.13) 
are Type 1 but not Type 2 MLEs, because they are not the solutions to any 
set of likelihood equations. In rare circumstances, there also exist MLEs that 
are Type 2 but not Type 1; see Kiefer (1978) for an example. 
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Computing ML Estimates 


Maximum likelihood estimates are often quite easy to compute. Indeed, for 
the three examples considered above, we were able to obtain explicit expres- 
sions. When no such expressions are available, as will often be the case, it is 
necessary to use some sort of nonlinear maximization procedure. Many such 
procedures are readily available. 


The discussion of Newton’s Method and quasi-Newton methods in Section 6.4 
applies with very minor changes to ML estimation. Instead of minimizing 
the sum of squared residuals function Q(), we maximize the loglikelihood 
function (0). Since the maximization is done with respect to 0 for a given 
sample y, we suppress the explicit dependence of Z on y. As in the NLS case, 
Newton’s Method makes use of the Hessian, which is now a k x k matrix H(@) 
with typical element 07¢(0)/00;00;. The Hessian is the matrix of second 
derivatives of the loglikelihood function, and thus also the matrix of first 
derivatives of the gradient. 


Let 0(;) denote the value of the vector of estimates at step j of the algorithm, 
and let gj) and H;) denote, respectively, the gradient and the Hessian eval- 
uated at 6(;). Then the fundamental equation for Newton’s Method is 
-1 

Bgn = bo — HG) go): (10.16) 
This may be obtained in exactly the same way as equation (6.42). Because 
the loglikelihood function is to be maximized, the Hessian should be negative 
definite, at least when @(;) is sufficiently near 8. This ensures that the step 
defined by (10.16) will be in an uphill direction. 


For the reasons discussed in Section 6.4, Newton’s Method will usually not 
work well, and will often not work at all, when the Hessian is not negative 
definite. In such cases, one popular way to obtain the MLE is to use some 
sort of quasi-Newton method, in which (10.16) is replaced by the formula 


=i 
9541) = 9G) + ag D Io 


where aj) is a scalar which is determined at each step, and Dj) is a matrix 
which approximates —H(;) near the maximum but is constructed so that it 
is always positive definite. Sometimes, as in the case of NLS estimation, an 
artificial regression can be used to compute the vector DG) gg) We will 
encounter one such artificial regression in Section 10.4, and another, more 


specialized, one in Section 11.3. 


When the loglikelihood function is globally concave and not too flat, maxi- 
mizing it is usually quite easy. At the other extreme, when the loglikelihood 
function has several local maxima, doing so can be very difficult. See the 
discussion in Section 6.4 following Figure 6.3. Everything that is said there 
about dealing with multiple minima in NLS estimation applies, with certain 
obvious modifications, to the problem of dealing with multiple maxima in ML 
estimation. 
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10.3 Asymptotic Properties of ML Estimators 


One of the attractive features of maximum likelihood estimation is that ML 
estimators are consistent under quite weak regularity conditions and asymp- 
totically normally distributed under somewhat stronger conditions. Therefore, 
if an estimator is an ML estimator and the regularity conditions are satisfied, 
it is not necessary to show that it is consistent or derive its asymptotic dis- 
tribution. In this section, we sketch derivations of the principal asymptotic 
properties of ML estimators. A rigorous discussion is beyond the scope of this 
book; interested readers may consult, among other references, Davidson and 
MacKinnon (1993, Chapter 8) and Newey and McFadden (1994). 


Consistency of the MLE 


Since almost all maximum likelihood estimators are of Type 1, we will discuss 
consistency only for this type of MLE. We first show that the expectation of 
the loglikelihood function is greater when it is evaluated at the true values of 
the parameters than when it is evaluated at any other values. For consistency, 
we also need both a finite-sample identification condition and an asymptotic 
identification condition. The former requires that the loglikelihood be different 
for different sets of parameter values. If, contrary to this assumption, there 
were two distinct parameter vectors, 0; and 02, such that L(y, 01) = L(y, 02) 
for all y, then it would obviously be impossible to distinguish between 0, 
and 0. Thus a finite-sample identification condition is necessary for the 
model to make sense. The role of the asymptotic identification condition will 
be discussed below. 


Let L(@) = exp(¢(@)) denote the likelihood function, where the dependence 
on y of both L and £ has been suppressed for notational simplicity. We wish to 
apply a result known as Jensen’s Inequality to the ratio L(0*)/L(0o), where 8o 
is the true parameter vector and @* is any other vector in the parameter space 
of the model. Jensen’s Inequality tells us that, if X is a real-valued random 
variable, then E(h(X)) < h(E(X)) whenever h(-) is a concave function. The 
inequality will be strict whenever h is strictly concave over at least part of the 
support of the random variable X, that is, the set of real numbers for which 
the density of X is nonzero, and the support contains more than one point. 
See Exercise 10.4 for the proof of a restricted version of Jensen’s Inequality. 


Since the logarithm is a strictly concave function over the nonnegative real 
line, and since likelihood functions are nonnegative, we can conclude from 
Jensen’s Inequality that 


L(0*) L(6*) 
Eo | log E 10.1 
otos( Fras) <18 Tay) a 
with strict inequality for all 0* Æ 09, on account of the finite-sample identifi- 


cation condition. Here the notation Eg means the expectation taken under the 
DGP characterized by the true parameter vector 09. Since the joint density 
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of the sample is simply the likelihood function evaluated at 0o, the expecta- 


tion on the right-hand side of (10.17) can be expressed as an integral over the 
support of the vector random variable y. We have 


ro (2) -/ Fant o)dy = [u= 


where the last equality here holds because every density must integrate to 1. 
Therefore, because log 1 = 0, the inequality (10.17) implies that 


In words, (10.18) says that the expectation of the loglikelihood function when 
evaluated at the true parameter vector, 00, is strictly greater than its expec- 
tation when evaluated at any other parameter vector, 0*. 


If we can apply a law of large numbers to the contributions to the loglikelihood 
function, then we can assert that plimn~!é(0) = limn~1Eo (0). Then (10.18) 
implies that 

plim = ¢(9*) < plim + ¢(40), (10.19) 
for all 0* Æ 0, where the inequality is not necessarily strict, because we have 
taken a limit. Since the MLE Ê maximizes ¢(@), it must be the case that 


plim = ¢(8) > plim + (80). (10.20) 


n— oo n— Co 


The only way that (10.19) and (10.20) can both be true is if 


plim + (ô) = plim + (00). (10.21) 


n— oo n— Co 


In words, (10.21) says that the plim of 1/n times the loglikelihood function 
must be the same when it is evaluated at the MLE @ as when it is evaluated 
at the true parameter vector 0p. 


By itself, the result (10.21) does not prove that 6 is consistent, because the 
weak inequality does not rule out the possibility that there may be many 
values 6* for which plimn~!é(@*) = plimn~1¢(@9). We must therefore ex- 
plicitly assume that plimn~!(6*) # plimn~!€(@9) for all 0* 4 @. This is a 
form of asymptotic identification condition; see Section 6.2. More primitive 
regularity conditions on the model and the DGP can be invoked to ensure 
that the MLE is asymptotically identified. For example, we need to rule out 
pathological cases like (3.20), in which each new observation adds less and 
less information about one or more of the parameters. 
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Dependent Observations 

Before we can discuss the asymptotic normality of the MLE, we need to 
introduce some notation and terminology, and we need to establish a few 
preliminary results. First, we consider the structure of the likelihood and 
loglikelihood functions for models in which the successive observations are not 
independent, as is the case, for instance, when a regression function involves 
lags of the dependent variable. 


Recall the definition (1.15) of the density of one random variable conditional 
on another. This definition can be rewritten so as to take the form of a 
factorization of the joint density: 


f(y, y2) as f(y) f (ye | yı), (10.22) 


where we use yı and yz in place of the variables x2 and z1, respectively, that 
appear in (1.15). It is permissible to apply (10.22) to situations in which 
yı and yo are really vectors of random variables. Accordingly, consider the 
joint density of three random variables, and group the first two together. 
Analogously to (10.22), we have 


f (yi, 42593) = f(y. y2) f (y3 | Y1, Y2). (10.23) 


Substituting (10.22) into (10.23) yields the following factorization of the joint 
density: 
f(y yo ¥3) = Fy) f (Y2 lw) Fs | v1, y2). 


For a sample of size n, it is easy to see that this last result generalizes to 


Fr- Yn) = f(y.) F(y2 ly) fn | yt, - Yn). 


This result can be written using a somewhat more convenient notation as 


follows: a 
=I (Yt | y“ J; 


where the vector y’ is a t-vector with components y1, Y2,..., Y+ One can 
think of yt as the subsample consisting of the first t observations of the full 
sample. For a model to be estimated by maximum likelihood, the density 
f(y”) will depend on a k-vector of parameters 0, and we can then write 


f(y", 0) =r flys l ytt; 0). (10.24) 


The structure of (10.24) is a straightforward generalization of that of (10.01), 
where the marginal densities of the successive observations are replaced by 
densities conditional on the preceding observations. 
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The loglikelihood function corresponding to (10.24) has an additive structure: 
L(y, 0) = X` &(y’, 8), (10.25) 
t=1 


where we omit the superscript n from y for the full sample. In addition, in 
the contributions ¢;(-) to the loglikelihood, we do not distinguish between the 
current variable y, and the lagged variables in the vector y*!. In this way, 
(10.25) has exactly the same structure as (10.02). 


The Gradient 


The gradient, or score, vector g(y, 0) is a k-vector that was defined in (10.15). 
As that equation makes clear, each component of the gradient vector is itself 
a sum of n contributions, and this remains true when the observations are 
dependent; the partial derivative of 4, with respect to 0; now depends on y’ 
rather than just y+. It is convenient to group these partial derivatives into a 
matrix. We define the n x k matrix G(y, 0) so as to have typical element 


t 
Gu(y*, 0) = ca a (10.26) 


This matrix is called the matrix of contributions to the gradient, because 
gi(y, 0) = X Gly’, 0). (10.27) 
t=1 


Thus each element of the gradient vector is the sum of the elements of one of 
the columns of the matrix G(y, 0). 


A crucial property of the matrix G(y, 0) is that, if y is generated by the DGP 
characterized by 0, then the expectations of all the elements of the matrix, 
evaluated at 0, are zero. This result is a consequence of the fact that all 
densities integrate to 1. Since @; is the log of the density of y conditional 


on ytt, we see that, for all t and for all 0, 


feola, 0)) dy: = frw 0) dy: = 1, 


where the integral is over the support of y;. Since this relation holds identically 
in 0, we can differentiate it with respect to the components of 0 and obtain 
a further set of identities. Under weak regularity conditions, it can be shown 
that the derivatives of the integral on the left-hand side are the integrals of 
the derivatives of the integrand. Thus, since the derivative of the constant 1 


is 0, we have, identically in 0 and for i= 1,...,k, 
OL, (y*, 0 
feola, 0)) a dy; = 0. (10.28) 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


406 The Method of Maximum Likelihood 


Since exp(¢;(y',@)) is, for the DGP characterized by 0, the density of y: 
conditional on y*~1, this last equation, along with the definition (10.26), gives 


Eo (Gr(y’, @)|y*") =0 (10.29) 


for allt =1,...,n andi =1,...,k. The notation “Eg” here means that the 
expectation is being taken under the DGP characterized by 0. Taking uncon- 
ditional expectations of (10.29) yields the desired result. Summing (10.29) 
over t = 1,..., shows that Ee(gi(y,0)) = 0 for i =1,...,k, or, equivalently, 
that Ee(g(y, @)) = 0. 


In addition to the conditional expectations of the elements of the matrix 
G(y,0), we can compute the covariances of these elements. Let t Æ s, and 
suppose, without loss of generality, that t < s. Then the covariance under the 
DGP characterized by 0 of the tit® and sj*® elements of G(y, 0) is 


Bo (Gui(y'. 8)Gs;(y", 8) = Bo (Eo (Guly’, 0)Gs;(y*,8)) |y')) 


(10.30) 

= Bo (Gu(y', 0)Eo (Gaj(y", 8) | y")) = 0. 
The step leading to the second line above follows because G;;(-) is a deter- 
ministic function of yt, and the last step follows because the expectation of 
G,;(-) is zero conditional on y*~', by (10.29), and so also conditional on the 
subvector yt of y57t. The above proof shows that the covariance of the two 
matrix elements is also zero conditional on y‘. 


The Information Matrix and the Hessian 


The covariance matrix of the elements of the t* row Gi(y', 0) of G(y,@) is 
the k x k matrix I,(0), of which the ij*® element is Eo(Gii(y’, 0)Gi;(y', 8)). 
As a covariance matrix, I;(@) is normally positive definite. The sum of the 
matrices I;(@) over all t is the k x k matrix 


I(0)=X_ 1,(9) = X Eo(Gi(y, )Gi(y, 9)), (10.31) 


t=1 


which is called the information matrix. The matrices I;(@) are the contribu- 
tions to the information matrix made by the successive observations. 


An equivalent definition of the information matrix, as readers are invited to 
show in Exercise 10.5, is I(0) = Eo(g(y,@)g'(y,9)). In this second form, 
the information matrix is the expectation of the outer product of the gradi- 
ent with itself; see Section 1.4 for the definition of the outer product of two 
vectors. Less exotically, it is just the covariance matrix of the score vector. 
As the name suggests, and as we will see shortly, the information matrix is 
a measure of the total amount of information about the parameters in the 
sample. The requirement that it should be positive definite is a condition 
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for strong asymptotic identification of those parameters, in the same sense as 
the strong asymptotic identification condition introduced in Section 6.2 for 
nonlinear regression models. 


Closely related to (10.31) is the asymptotic information matrix 


J(0) = plimo +1(6), (10.32) 


n—> oo 


which measures the average amount of information about the parameters that 
is contained in the observations of the sample. As with the notation Eg, we 
use plimg to denote the plim under the DGP characterized by 0. 


We have already defined the Hessian H(y,6@). For asymptotic analysis, we 
will generally be more interested in the asymptotic Hessian, 


H(0) = plimo +H (y, 6), (10.33) 


11 OO 


than in H(y,@) itself. The asymptotic Hessian is related to the ordinary 
Hessian in exactly the same way as the asymptotic information matrix is 
related to the ordinary information matrix; compare (10.32) and (10.33). 


There is a very important relationship between the asymptotic information 
matrix and the asymptotic Hessian. One version of this relationship, which is 
called the information matrix equality, is 


J(0) = —H(6). (10.34) 


Both the Hessian and the information matrix measure the amount of curvature 
in the loglikelihood function. Although they are both measuring the same 
thing, the Hessian is negative definite, at least in the neighborhood of Ô, 
while the information matrix is always positive definite; that is why there is 
a minus sign in (10.34). The proof of (10.34) is the subject of Exercises 10.6 
and 10.7. It depends critically on the assumption that the DGP is a special 
case of the model being estimated. 


Asymptotic Normality of the MLE 


In order for it to be asymptotically normally distributed, a maximum likeli- 
hood estimator must be a Type 2 MLE. In addition, it must satisfy certain 
regularity conditions, which are discussed in Davidson and MacKinnon (1993, 
Section 8.5). The Type 2 requirement arises because the proof of asymptotic 
normality is based on the likelihood equations (10.14), which apply only to 
Type 2 estimators. 


The first step in the proof is to perform a Taylor expansion of the likelihood 
equations (10.14) around 09. This expansion yields 


9(6) = 9(00) + H()(6 — 85) = 0, (10.35) 
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where we suppress the dependence on y for notational simplicity. The notation 
0 is our usual shorthand notation for Taylor expansions of vector expressions; 
see (6.20) and the subsequent discussion. We may therefore write 


| — | < |]@ — eoll. 


The fact that the ML estimator 6 is consistent then implies that @ is also 
consistent. 


If we solve (10.35) and insert the factors of powers of n that are needed for 
asymptotic analysis, we obtain the result that 


n'/?(6 — 09) = —(n71H(@)) “(n7"/2g(Op)). (10.36) 


Because ð is consistent, the matrix n~!H(@) which appears in (10.36) must 
tend to the same nonstochastic limiting matrix as n~'H(@ 9), namely, H(00). 
Therefore, equation (10.36) implies that 


n'/2(@ — Ay) = -H+ (0o) n7! ?g(00). (10.37) 


If the information matrix equality, equation (10.34), holds, then this result 
can equivalently be written as 


n'/2(@ — Ay) = I~! (Ay) n~/7g(Op). (10.38) 


Since the information matrix equality holds only if the model is correctly 
specified, (10.38) is not in general valid for misspecified models. 


The asymptotic normality of the Type 2 MLE follows immediately from the 
asymptotic equalities (10.37) or (10.38) if it can be shown that the vector 
n—'/2g(@9) is asymptotically distributed as multivariate normal. As can be 
seen from (10.27), each element n~!/?g;(@9) of this vector is n~!/? times a 
sum of n random variables, each of which has mean 0, by (10.29). Under 
standard regularity conditions, with which we will not concern ourselves, a 
multivariate central limit theorem can therefore be applied to this vector. For 
finite n, the covariance matrix of the score vector is, by definition, the infor- 
mation matrix I(@9). Thus the covariance matrix of the vector n~!/?g(0o) 
is n~1I(09), of which, by (10.32), the limit as n — oo is the asymptotic 
information matrix J(@9). It follows that 


plim (n~'/*g(8o)) ~ N (0, I(80)). (10.39) 


n— Co 


This result, when combined with (10.37) or (10.38), implies that the Type 2 
MLE is asymptotically normally distributed. 
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10.4 The Covariance Matrix of the ML Estimator 


For Type 2 ML estimators, we can obtain the asymptotic distribution of 
the estimator by combining the result (10.39) for the asymptotic distribution 
of n~'/?g(@9) with the result (10.37). The asymptotic distribution of the 
estimator is the distribution of the random variable plim n!/?(@ — 8o). This 
distribution is normal, with mean vector zero and covariance matrix 


Var (plim ni/2(9 — 80)) = H7*(@0)I(@0) H71 (80), (10.40) 


n— Co 


which has the form of a sandwich covariance matrix. When the information 
matrix equality, equation (10.34), holds, the sandwich simplifies to 


Var (plim n!/2(ĝ — 8o)) = I~ + (80). (10.41) 


n— oo 


Thus the asymptotic information matrix is seen to be the asymptotic precision 
matrix of a Type 2 ML estimator. This shows why the matrices I and J are 
called information matrices of various sorts. 


Clearly, any method that allows us to estimate J(@9) consistently can be 
used to estimate the covariance matrix of the ML estimates. In fact, several 
different methods are widely used, because each has advantages in certain 
situations. 


The first method is just to use minus the inverse of the Hessian, evaluated at 
the vector of ML estimates. Because these estimates are consistent, it is valid 
to evaluate the Hessian at 0 rather than at 09. This yields the estimator 


Var (6) = —H~1(6), (10.42) 


which is referred to as the empirical Hessian estimator. Notice that, since it is 
the covariance matrix of Ê in which we are interested, the factor of n!/? is no 
longer present. This estimator is easy to obtain whenever Newton’s Method, 
or some sort of quasi-Newton method that uses second derivatives, is used to 
maximize the loglikelihood function. In the case of quasi-Newton methods, 
H (ô) may sometimes be replaced by another matrix that approximates it. 
Provided that n~! times the approximating matrix converges to H(@), this 
sort of replacement is asymptotically valid. 


Although the empirical Hessian estimator often works well, it does not use 
all the information we have about the model. Especially for simpler models, 
we may actually be able to find an analytic expression for I(@). If so, we 
can use the inverse of I(@), evaluated at the ML estimates. This yields the 
information matrix, or IM, estimator 


Varia (0) = I71(6). (10.43) 
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The advantage of this estimator is that it normally involves fewer random 
terms than does the empirical Hessian, and it may therefore be somewhat 
more efficient. In the case of the classical normal linear model, to be discussed 
below, it is not at all difficult to obtain I(@), and the information matrix 
estimator is therefore the one that is normally used. 


The third method is based on (10.31), from which we see that 
I(0o) = E(G"(80)G(O)). 


We can therefore estimate n~! (0o) consistently by n~'G"(@)G(@). The 
corresponding estimator of the covariance matrix, which is usually called the 
outer-product-of-the-gradient, or OPG, estimator, is 


Var opa (8) = (G"(6)G(6)) (10.44) 


The OPG estimator has the advantage of being very easy to calculate. Unlike 
the empirical Hessian, it depends solely on first derivatives. Unlike the IM 
estimator, it requires no theoretical calculations. However, it tends to be less 
reliable in finite samples than either of the other two. The OPG estimator is 
sometimes called the BHHH estimator, because it was advocated by Berndt, 
Hall, Hall, and Hausman (1974) in a very well-known paper. 


In practice, the estimators (10.42), (10.43), and (10.44) are all commonly used 
to estimate the covariance matrix of ML estimates, but many other estimators 
are available for particular models. Often, it may be difficult to obtain I(@), 
but not difficult to obtain another matrix that approximates it asymptotically, 
by starting either from the matrix —H (0) or from the matrix G'(@)G(@) and 
taking expectations of some elements. 


A fourth covariance matrix estimator, which follows directly from (10.40), is 
the sandwich estimator 


Vars(6) = H~'(6)G"(6)G(6)H~*(6). (10.45) 


In normal circumstances, this estimator has little to recommend it. It is 
harder to compute than the OPG estimator and can be just as unreliable in 
finite samples. However, unlike the other three estimators, it will be valid 
even when the information matrix equality does not hold. Since this equality 
will generally fail to hold when the model is misspecified, it may be desirable 
to compute (10.45) and compare it with the other estimators. 


When an ML estimator is applied to a model which is misspecified in ways 
that do not affect the consistency of the estimator, it is said to be a quasi- 
ML estimator, or QMLE; see White (1982) and Gouriéroux, Monfort, and 
Trognon (1984). In general, the sandwich covariance matrix estimator (10.45) 
is valid for QML estimators, but the other covariance matrix estimators, which 
depend on the information matrix equality, are not valid. At least, they are 
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not valid for all the parameters. We have seen that the ML estimator for a 
regression model with normal errors is just the OLS estimator. But we know 
that the latter is consistent under conditions which do not require normality. 
If the error terms are not normal, therefore, the ML estimator is a QMLE. 
One consequence of this fact is explored in Exercise 10.8. 


The Classical Normal Linear Model 


It should help to make the theoretical results just discussed clearer if we apply 
them to the classical normal linear model. We will therefore discuss various 
ways of estimating the covariance matrix of the ML estimates @ and ô jointly. 
Of course, we saw in Section 3.4 how to estimate the covariance matrix of 3 
by itself, but we have not yet discussed how to estimate the variance of ô. 


For the classical normal linear model, the contribution to the loglikelihood 
function made by the t*™ observation is given by expression (10.09). There 
are k + 1 parameters. The first k of them are the elements of the vector p, 
and the last one is o. A typical element of any of the first k columns of the 
matrix G, indexed by i, is 


ol 1 
Gu(B,o) = 5B, = s(ue—XB)Xi, i=1,...,k, (10.46) 
and a typical element of the last column is 
Olt toot ; 
Gi k+1 (8,0) = pn XB). (10.47) 


These two equations give us everything we need to calculate the information 
matrix. 


For i,j =1,...,k, the ijt” element of G'G is 


n 


1 
D oa Yt = XB)? Xi Xj. (10.48) 


t=1 


This is just the sum over all t of G;;(G, o) times G4; (G, 0) as defined in (10.46). 
When we evaluate at the true values of @ and c, we have that yt — Xib = ut 
and E(u?) = o°, and so the expectation of this matrix element is easily seen 
to be 


2 1 
` ay Xu Xij: (10.49) 


In matrix notation, the whole 3-@ block of G'G has expectation X'X /o?. 
The (i,k + 1)*® element of G'G is 


D ( + Z (yi x,6)) (Y: — X16) Xu) 


t=1 


(10.50) 
= ae (Ye — XB )Xui + a == Xib)? Xi. 
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This is the sum over all t of the product of expressions (10.46) and (10.47). 
We know that E(u;) = 0, and, if the error terms u, are normal, we also 
know that E(u?) = 0. Consequently, the expectation of this sum is 0. This 
result depends critically on the assumption, following from normality, that 
the distribution of the error terms is symmetric around zero. For a skewed 
distribution, the third moment would be nonzero, and (10.50) would therefore 
not have mean 0. 


Finally, the (k + 1), (k + 1)™ element of G'G is 


ETET] 


nar n (10.51) 
n 2 1 
T 2 a (ye — Xf)? + a ae (ye — XB)“. 
t=1 t=1 


This is the sum over all t of the square of expression (10.47). To compute its 
expectation, we replace y, — X;@ by uz and use the result that E(u?) = 304; 
see Exercise 4.2. It is then not hard to see that expression (10.51) has ex- 
pectation 2n/o?. Once more, this result depends crucially on the normality 
assumption. If the kurtosis of the error terms were greater (or less) than that 
of the normal distribution, the expectation of expression (10.51) would be 
larger (or smaller) than 2n/o?. 


Putting the results (10.49), (10.50), and (10.51) together, the asymptotic 
information matrix for 8B and ø jointly is seen to be 


J(B,o0) = plim (10.52) 


n— oo 


ae A 
o! 2/07 | 


Inverting this matrix, multiplying the inverse by n7t, and replacing o by ĝ, 
we find that the IM estimator of the covariance matrix of all the parameter 
estimates is 


eS ae | (10.53) 


Varn(8.) = | 
(8.6) o! 6? /2n 
The upper left-hand block of this matrix would be the familiar OLS covariance 
matrix if we had used s instead of ô to estimate ø. The lower right-hand 
element is the approximate variance of ô, under the assumption of normally 
distributed error terms. 


It is noteworthy that the information matrix (10.52), and therefore also the 
estimated covariance matrix (10.53), are block-diagonal. This implies that 
there is no covariance between 3 and ô. This is a property of all regression 
models, nonlinear as well as linear, and it is responsible for much of the 
simplicity of these models. The block-diagonality of the information matrix 
means that we can make inferences about 8B without taking account of the fact 
that ø has also been estimated, and we can make inferences about o without 
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taking account of the fact that B has also been estimated. If the information 
matrix were not block-diagonal, which in most other cases it is not, it would 
have been necessary to invert the entire matrix in order to obtain any block 
of the inverse. 


Asymptotic Efficiency of the ML Estimator 


A Type 2 ML estimator must be at least as asymptotically efficient as any 
other root-n consistent estimator that is asymptotically unbiased. There- 
fore, at least in large samples, maximum likelihood estimation possesses an 
optimality property that is generally not shared by other estimation methods. 
We will not attempt to prove this result here; see Davidson and MacKinnon 
(1993, Section 8.8). However, we will discuss it briefly. 


Consider any other root-n consistent and asymptotically unbiased estimator, 
say 0. It can be shown that 


plim n'/?(@ — 09) = plim n‘/?(6 — @) + v, (10.54) 
where v isa random k—vector that has mean zero and is uncorrelated with 
the vector plim n!/?(@ — 09). This means that, from (10.54), we have 


Var(plim n'/?(@ — 09)) = Var(plim n'/2(6 — 80)) + Var(v). (10.55) 


noo n— Co 


Since Var(v) must be a positive semidefinite matrix, we conclude that the 
asymptotic covariance matrix of the estimator 0 must be larger than that of 
0, in the usual sense. 


The asymptotic equality (10.54) bears a strong, and by no means coincidental, 
resemblance to a result that we used in Section 3.5 when proving the Gauss- 
Markov Theorem. This result says that, in the context of the linear regression 
model, any unbiased linear estimator can be written as the sum of the OLS 
estimator and a random component which has mean zero and is uncorrelated 
with the OLS estimator. Asymptotically, equation (10.54) says essentially the 
same thing in the context of a very much broader class of models. The key 
property of (10.54) is that v is uncorrelated with plim n!/?(@ — 09). Therefore, 
v simply adds additional noise to the ML estimator. 


The asymptotic efficiency result (10.55) is really an asymptotic version of the 
Cramér-Rao lower bound,’ which actually applies to any unbiased estima- 
tor, regardless of sample size. It states that the covariance matrix of such an 


4 All of the root-n consistent estimators that we have discussed are also asymp- 
totically unbiased. However, as is discussed in Davidson and MacKinnon (1993, 
Section 4.5), it is possible for such an estimator to be asymptotically biased, 
and we must therefore rule out this possibility explicitly. 


5 This bound was originally suggested by Fisher (1925) and later stated in its 
modern form by Cramér (1946) and Rao (1945). 
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estimator can never be smaller than I~', which, as we have seen, is asymp- 
totically equal to the covariance matrix of the ML estimator. Readers are 
guided through the proof of this classical result in Exercise 10.12. However, 
since ML estimators are not in general unbiased, it is only the asymptotic 
version of the bound that is of interest in the context of ML estimation. 


The fact that ML estimators attain the Cramér-Rao lower bound asymptotic- 
ally is one of their many attractive features. However, like the Gauss-Markov 
Theorem, this result must be interpreted with caution. First of all, it is only 
true asymptotically. ML estimators may or may not perform well in samples 
of moderate size. Secondly, there may well exist an asymptotically biased 
estimator that is more efficient, in the sense of finite-sample mean squared 
error, than any given ML estimator. For example, the estimator obtained 
by imposing a restriction that is false, but not grossly incompatible with the 
data, may well be more efficient than the unrestricted ML estimator. The 
former cannot be more efficient asymptotically, because the variance of both 
estimators tends to zero as the sample size tends to infinity and the bias of 
the biased estimator does not, but it can be more efficient in finite samples. 


10.5 Hypothesis Testing 


Maximum likelihood estimation offers three different procedures for perform- 
ing hypothesis tests, two of which usually have several different variants. 
These three procedures, which are collectively referred to as the three classical 
tests, are the likelihood ratio, Wald, and Lagrange multiplier tests. All three 
tests are asymptotically equivalent, in the sense that all the test statistics 
tend to the same random variable (under the null hypothesis, and for DGPs 
that are “close” to the null hypothesis) as the sample size tends to infinity. 
If the number of equality restrictions is r, this limiting random variable is 
distributed as y?(r). We have already discussed Wald tests in Sections 6.7 
and 8.5, but we have not yet encountered the other two classical tests, at 
least, not under their usual names. 


As we remarked in Section 4.6, a hypothesis in econometrics corresponds to 
a model. We let the model that corresponds to the alternative hypothesis 
be characterized by the loglikelihood function ¢(@). Then the null hypothesis 
imposes r restrictions, which are in general nonlinear, on 0. We write these as 
r(@) = 0, where r(@) is an r-vector of smooth functions of the parameters. 
Thus the null hypothesis is represented by the model with loglikelihood ¢(6), 
where the parameter space is restricted to those values of 0 that satisfy the 
restrictions r(@) = 0. 


Likelihood Ratio Tests 


The likelihood ratio, or LR, test is the simplest of the three classical tests. 
The test statistic is just twice the difference between the unconstrained max- 
imum value of the loglikelihood function and the maximum subject to the 
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restrictions: 


LR = 2(¢(8) — £(8)). (10.56) 


Here @ and Ô denote, respectively, the restricted and unrestricted maximum 
likelihood estimates of 0. The LR statistic gets its name from the fact that 
the right-hand side of (10.56) is equal to 


or twice the logarithm of the ratio of the likelihood functions. One of its 
most attractive features is that the LR statistic is trivially easy to compute 
when both the restricted and unrestricted estimates are available. Whenever 
we impose, or relax, some restrictions on a model, twice the change in the 
value of the loglikelihood function provides immediate feedback on whether 
the restrictions are compatible with the data. 


Precisely why the LR statistic is asymptotically distributed as y?(r) is not 
entirely obvious, and we will not attempt to explain it now. The asymptotic 
theory of the three classical tests will be discussed in detail in the next section. 
Some intuition can be gained by looking at the LR test for linear restrictions 
on the classical normal linear model. The LR statistic turns out to be closely 
related to the familiar F statistic, which can be written as 


p — (SSRI) = SSR(@)) r (10.87) 


SSR(B)/(n — k) 


where 3 and Õ are the unrestricted and restricted OLS (and hence also ML) 
estimators, respectively. The LR statistic can also be expressed in terms of 
the two sums of squared residuals, by use of the formula (10.12), which gives 
the maximized loglikelihood in terms of the minimized SSR. The statistic is 


2(€(6) — €(6)) =2 (2 log SSR() — 4 log SSR(Ê)) 
. (10.58) 


= noel Sen) 


We can rewrite the last expression here as 


_ SSR(Õ) — SSR(Ê)\ _ i ` 
n log ( SSR(B) ) = nlog ( + =") Ser. 


The approximate equality above follows from the facts that n/(n—k) = 1 and 


that log(1 + a) = a whenever a is small. Under the null hypothesis, SSR(@) 
should not be much larger than SSR(@), or, equivalently, F'/(n — k) should be 
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a small quantity, and so this approximation should generally be a good one. 
We may therefore conclude that the LR statistic (10.58) is asymptotically 
equal to r times the F statistic. Whether or not this is so, the LR statistic is 
a deterministic, strictly increasing, function of the F statistic. As we will see 
later, this fact has important consequences if the statistics are bootstrapped. 
Without bootstrapping, it makes little sense to use an LR test rather than 
the F test in the context of the classical normal linear model, because the 
latter, but not the former, is exact in finite samples. 


Wald Tests 


Unlike LR tests, Wald tests depend only on the estimates of the unrestricted 
model. There is no real difference between Wald tests in models estimated 
by maximum likelihood and those in models estimated by other methods; see 
Sections 6.7 and 8.5. As with the LR test, we wish to test the r restrictions 


r(0) = 0. The Wald test statistic is just a quadratic form in the vector r(0) 
and the inverse of a matrix that estimates its covariance matrix. 


By using the delta method (Section 5.6), we find that 


S 
la 
Fase 
3 
> 
a= 
IG 
A 
a 
S 
Fq 
> 


)R'(00), (10.59) 


where R(@) is an r x k matrix with typical element Or;(@)/00;. In the last 
section, we saw that Var(@) can be estimated in several ways. Substituting 


any of these estimators, denoted Var (0), for Var(Ô) in (10.59) and replacing 
the unknown ĝo by 0, we find that the Wald statistic is 


a — Bre tas 


W = r'(6)(R(6)Var(6)R'(6)) `r (ô). (10.60) 


This is a quadratic form in the r-vector r(0), which is asymptotically multi- 
variate normal, and the inverse of an estimate of its covariance matrix. It is 
easy to see, using the first part of Theorem 4.1, that (10.60) is asymptotically 
distributed as x?(r) under the null hypothesis. As readers are asked to show 
in Exercise 10.13, the Wald statistic (6.71) is just a special case of (10.60). In 
addition, in the case of linear regression models subject to linear restrictions 
on the parameters, the Wald statistic (10.60) is, like the LR statistic, a de- 
terministic, strictly increasing, function of the F statistic if the information 
matrix estimator (10.43) of the covariance matrix of the parameters is used 
to construct the Wald statistic. 


Wald tests are very widely used, in part because the square of every t statistic 
is really a Wald statistic. Nevertheless, they should be used with caution. 
Although Wald tests do not necessarily have poor finite-sample properties, 
and they do not necessarily perform less well in finite samples than the other 
classical tests, there is a good deal of evidence that they quite often do so. 
One reason for this is that Wald statistics are not invariant to reformulations 
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of the restrictions. Some formulations may lead to Wald tests that are well- 
behaved, but others may lead to tests that severely overreject, or (much less 
commonly) underreject, in samples of moderate size. 


As an example, consider the linear regression model 
Yt = bo + AX + b2Xt2 + ut, (10.61) 


where we wish to test the hypothesis that the product of 3, and ĝa is 1. To 
compute a Wald statistic, we need to estimate the covariance matrix of (3, 
and Bo. If X denotes the n x 2 matrix with typical element X;;, and M, is 
the matrix that takes deviations from the mean, then the IM estimator of this 
covariance matrix is 


Var (61, Bo) = ôĉ?° (XTM, XY !; (10.62) 


we could of course use s? instead of ô?. For notational convenience, we will 
let Via, Vio (= V21), and V22 denote the three distinct elements of this matrix. 
There are many ways to write the single restriction on (10.61) that we wish 
to test. Three that seem particularly natural are 

r1((1, G2) = bı = 1/62 = 0, 

r2((G1, G2) = b2 — 1/61 = 0, and 

r3(61, G2) = i162 — 1 = 0. 
Each of these ways of writing the restriction leads to a different Wald statistic. 
If the restriction is written in the form of rı, then R(61, 62) = [1 1/83]. 
Combining this with (10.62), we find after a little algebra that the Wald 
statistic is . E 

is (bı — 1/2)? 
Vir + 2Vi2/03 + V22/ 83 


If instead the restriction is written in the form of r2, then R(61, 32) = 
[1/32 1], and the Wald statistic is 


W — (Bo S 1/61)? 
2 = 7 z : 
Vi1/ Bt + 2Vi2/ G2 + V22 


Finally, if the restriction is written in the form of r3, then R(61, 62) = 
[G2 31], and the Wald statistic is 


= (6162 —1)? 
BEVi1 + 281 BoVi2 + 6? V22 


In finite samples, these three Wald statistics can be quite different. Depending 
on the values of 3, and (2, any one of them may perform better or worse than 
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the other two, and they can sometimes overreject severely. The performance of 
alternative Wald tests in models like (10.61) has been investigated by Gregory 
and Veall (1985, 1987). Other cases in which Wald tests perform very badly 
are discussed by Lafontaine and White (1986). 


Because of their dubious finite-sample properties and their sensitivity to the 
way in which the restrictions are written, we recommend against using Wald 
tests when the outcome of a test is important, except when it would be very 
costly or inconvenient to estimate the restricted model. Asymptotic t statistics 
should also be used with great caution, since, as we saw in Section 6.7, every 
asymptotic t statistic is simply the signed square root of a Wald statistic. 
Because conventional confidence intervals are based on inverting asymptotic 
t statistics, they too should be used with caution. 


Lagrange Multiplier Tests 


The Lagrange multiplier, or LM, test is the third of the three classical tests. 
The name suggests that it is based on the vector of Lagrange multipliers from 
a constrained maximization problem. That can indeed be the case. In prac- 
tice, however, LM tests are very rarely computed in this way. Instead, they 
are usually based on the gradient vector, or score vector, of the unrestricted 
loglikelihood function, evaluated at the restricted estimates. LM tests are 
very often computed by means of artificial regressions. In fact, as we will see, 
some of the GNR-based tests that we encountered in Sections 6.7 and 7.7 are 
essentially Lagrange multiplier tests. 


It is easiest to begin our discussion of LM tests by considering the case in 
which the restrictions to be tested are zero restrictions, that is, restrictions 
according to which some of the model parameters are zero. In such cases, 
the r restrictions can be written as 02 = 0, where the parameter vector @ is 
partitioned as 0 = [01 i 02], possibly after some reordering of the elements. 
The vector 6 of restricted estimates can then be expressed as 6 = [ĝ; : OJ. 
The vector 0, maximizes the restricted loglikelihood function (01,0), and so 
it satisfies the restricted likelihood equations 


gi(01,0) = 0, (10.63) 


where gi(-) is the vector whose components are the k — r partial derivatives 
of £(-) with respect to the elements of 01. 


The formula (10.38), which gives the asymptotic form of an MLE, can be 
applied to the estimator @. If we partition the true parameter vector 09 as 
[O° : 0], we find that 


n¥/2(6, — 09) = (11) (00) 2 */?g1 (80), (10.64) 


where J11(-) is the (k—r) x (k—r) top left block of the asymptotic information 
matrix J(-) of the full unrestricted model. This block is, of course, just the 
asymptotic information matrix for the restricted model. 
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When the gradient vector of the unrestricted loglikelihood function is eval- 
uated at the restricted estimates 0, the first k — r elements, which are the 
elements of gi(@), are zero, by (10.63). However, the r-vector g2(@), which 
contains the remaining r elements, is in general nonzero. In fact, a Taylor 


expansion gives 
n—*/?go(8) = n™™?g2(00) +n! Ho (8) n'/?(8; — 69), (10.65) 


where our usual shorthand notation 0 is used for a vector that tends to ĝo as 
n — oo, and H;(-) is the lower left block of the Hessian of the loglikelihood. 
The information matrix equality (10.34) shows that the limit of (10.65) for a 
correctly specified model is 


plim n~!/2go(6) = plim n~!/2g(@9) — 33, plim n!/?(6, — 09) 


n— CO n— Co n— Co 


= plim (n~"/?go(80) — I2, (111) >n" ?g1(80)) (10.66) 


a | 


=(=9 (391 I] plim 
| 21( 11) ] n—"/2g5(@o) 


n— Co 


where J° = J(@9), I is an r x r identify matrix, and the second line follows 
from (10.64). 


Since the variance of the full gradient vector, plimn~!/ 29(0), is just Jo, the 
variance of the last expression in (10.66) is 


Var(plim n~"/?go(6)) = [-3 (7%) T] | 


NOC 


Hy I9] T -Oh) 192. 
Io, Je I 


= J — Ja (Ia) Jis (10.67) 


In Exercise 7.11, expressions were developed for the blocks of the inverses of 
partitioned matrices. It is easy to see from those expressions that the inverse 
of (10.67) is the 22 block of J~'(@9). Thus, in order to obtain a statistic in 
asymptotically +? form based on g2(8), we can construct the quadratic form 


LM = n= 1/?go!(8)(J~*)22 n7"? ga(8) = gal (8)(I~*)22 92(8), (10.68) 


in which J = n~!I(0), and the notations (J-")o and (I71)29 signify the 
22 blocks of the inverses of J and I(@), respectively. 


Since the statistic (10.68) is a quadratic form in an r—vector, which is asymp- 
totically normally distributed with mean 0, and the inverse of an r x r matrix 
that consistently estimates the covariance matrix of that vector, it is clear 
that the LM statistic is asymptotically distributed as y?(r) under the null. 


However, expression (10.68) is notationally awkward. Because gi(@) = 0 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


420 The Method of Maximum Likelihood 


by (10.63), we can rewrite it as what appears to be a quadratic form with k 
rather than r degrees of freedom, as follows, 


LM = g'(6)I~1g(8), (10.69) 


where the notational awkwardness has disappeared. In addition, since (10.69) 
no longer depends on the partitioning of @ that we used to express the zero 
restrictions, it is applicable quite generally, whether or not the restrictions 
are zero restrictions. This follows from the invariance of the LM test under 
reparametrizations of the model; see Exercise 10.14. 


Expression (10.69) is the statistic associated with the score form of the 
LM test, often simply called the score test, since it it defined in terms of the 
score vector g(0) evaluated at the restricted estimates 8. It must of course be 
kept in mind that, despite the appearance of (10.69), it has only r, and not k, 
degrees of freedom. This “using up” of k — r degrees of freedom is due to the 
fact that the k — r elements of 8; are estimated. It is entirely analogous to 
a similar phenomenon discussed in Sections 9.4 and 9.5, in connection with 
Hansen-Sargan tests. 


One way to maximize the loglikelihood function ¢(@) subject to the restrictions 
r(@) = 0 is simultaneously to maximize the Lagrangian 


(0) — r'(0)A 


with respect to 0 and minimize it with respect to the r—vector of Lagrange 
multipliers A. The first-order conditions that characterize the solution to this 
problem are the k + r equations 


The first set of these equations allows us to rewrite the LM statistic (10.69) 
in terms of the Lagrange multipliers A, thereby obtaining the LM form of the 
test: 

LM =A'RI“'R'D, (10.70) 
where R = R(@). The score form (10.69) is used much more often than the 
LM form (10.70), because g(@) is almost always available, no matter how 


the restricted estimates are obtained, whereas A is available only if they are 
obtained by using a Lagrangian. 


LM Tests and Artificial Regressions 


We have so far assumed that the information matrix estimator used to con- 
struct the LM statistic is J = I (6). Because this estimator is usually more 
efficient than other estimators of the information matrix, I is often referred 
to as the efficient score estimator of the information matrix. However, there 
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are as many different ways to compute any given LM statistic as there are 
asymptotically valid ways to estimate the information matrix. In practice, I 
is often replaced by some other estimator, such as minus the empirical Hessian 
or the OPG estimator. For example, if the OPG estimator is used in (10.69), 
the statistic becomes 


g (GIGY, (10.71) 


where g = g(0) and G = G(@). This OPG variant of the statistic is asymptot- 
ically, but not numerically, equivalent to the efficient score variant computed 
using I. In contrast, the score and LM forms of the test are numerically 
equivalent provided both are computed using the same information matrix 
estimator. 


The statistic (10.71) can readily be computed by use of an artificial regression 
called the OPG regression, which has the general form 


t = G(@)c + residuals, (10.72) 


where z is an n-vector of 1s. This regression can be constructed for any model 
for which the loglikelihood function can be written as the sum of n contribu- 
tions. If we evaluate (10.72) at the vector of restricted estimates 6, it becomes 


t= Gc + residuals, (10.73) 
and the explained sum of squares is 
PEGE ET = F&T, 


by (10.27). The right-hand side above is equal to expression (10.71), and so 
the ESS from regression (10.73) is numerically equal to the OPG variant of 
the LM statistic. 


In the case of regression (10.72), the total sum of squares is just n, the squared 
length of the vector v. Therefore, ESS = n — SSR. This result gives us a 
particularly easy way to calculate the LM test statistic, and it also puts an 
upper bound on it: The OPG variant of the LM statistic can never exceed 
the number of observations in the OPG regression. 


Although the OPG form of the LM test is easy to calculate for a very wide va- 
riety of models, it does not have particularly good finite-sample properties. In 
fact, there is a great deal of evidence to suggest that this form of the LM test is 
much more likely to overreject than any other form and that it can overreject 
very severely in some cases. Therefore, unless it is bootstrapped, the OPG 
form of the LM test should be used with great caution. See Davidson and 
MacKinnon (1993, Chapter 13) for references. Fortunately, in many circum- 
stances, other artificial regressions with much better finite-sample properties 
are available; see Davidson and MacKinnon (2001). 
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LM Tests and the GNR 


Consider again the case of linear restrictions on the parameters of the classical 
normal linear model. By summing the contributions (10.46) to the gradient, 
we see that the gradient of the loglikelihood for this model with respect to 8 
can be written as 


9(8.0) =  X"ly ~ XB), 


Since the information matrix (10.52) is block-diagonal, we need not bother 
with the gradient with respect to ø in order to compute the LM statis- 
tic (10.69). From (10.49), we know that the B-B block of the information 
matrix is o °X X. Thus, if we write the restricted estimates of the para- 
meters as 3 and ĝ, the statistic (10.69), computed with the efficient score 
estimator of the information matrix, takes the form 


5 (u — XB)'X(X™X)"X"(y - XB). (10.74) 


This variant of the LM statistic is, like the LR and some variants of the Wald 
statistic, a deterministic, strictly increasing, function of the F statistic (10.57); 
see Exercise 10.17. 


More generally, for a nonlinear regression model subject to possibly nonlinear 
restrictions on the parameters, we see that, by analogy with (10.74), the 
LM statistic can be written as 


1 TETES z 

ai= &)'X(X'X)'X'(y— &), (10.75) 
where & = æ(ĝ) is the n-vector of nonlinear regression functions evaluated 
at the restricted ML estimates 6, and X = X() is the n x k matrix of 
derivatives of the regression functions with respect to the components of 3. It 
is easy to show that (10.75) is just n times the uncentered R? from the GNR 


y —& = Xb + residuals, 


which corresponds to the unrestricted nonlinear regression, evaluated at the 
restricted estimates. As we saw in Section 6.7, this is one of the valid statistics 
that can be computed using a GNR. 


Bootstrapping the Classical Tests 


When two or more of the classical test statistics differ substantially in magni- 
tude, or when we have any other reason to believe that asymptotic tests based 
on them may not be reliable, bootstrap tests provide an attractive alterna- 
tive to asymptotic ones. Since maximum likelihood requires a fully specified 
model, it is appropriate to use a parametric bootstrap, rather than resampling. 
Since, for any given parameter vector 0, the likelihood function is the PDF 
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of the dependent variable, parametric bootstrap samples y* will simply be 
realizations of vector random variables from the distribution characterized by 
that PDF, evaluated at a consistent estimate of the model parameters. This 
estimate must of course satisfy the restrictions to be tested, and so the natural 
choice, and usually the best one, is the vector of restricted ML estimates. 


The procedure we recommend for bootstrapping any of the classical tests is 
very similar to the procedure for bootstrapping F tests that was discussed 
in Section 4.6. The model is estimated under the null to obtain the vector 
of restricted estimates 0, and the desired test statistic, 7, is computed. This 
step may, of course, entail the estimation of the unrestricted model. One then 
generates B bootstrap samples using the DGP characterized by @. For each 
of them, a bootstrap statistic Tř, J =1,...,B, is computed in the same way 
as was T. A bootstrap P value can then be obtained in the usual way as the 


proportion of bootstrap statistics more extreme than 7 itself; see (4.61). 


We strongly recommend use of the bootstrap whenever there is any reason to 
believe that classical tests based on asymptotic theory may not be reliable, 
unless calculating a moderate number of T; is computationally infeasible. 
When this calculation is expensive, methods that do not use a fixed value of 
B may be attractive; see Davidson and MacKinnon (2000). 


It is important to note that, as we saw earlier in this section for some tests in 
linear regression models, certain classical test statistics may be deterministic, 
strictly increasing, functions of other statistics. The bootstrap P values will 
be identical for statistics related in this way, since a bootstrap P value depends 
only on the ordering of the statistic 7 and the bootstrap statistics 7; , and this 
ordering is invariant under a deterministic, strictly increasing, function. If we 
can readily compute a number of test statistics that are not deterministically 
related, it is desirable to bootstrap all of them at once. This will usually 
be much cheaper than bootstrapping them separately. In general, we would 
expect the bootstrap P values from the various tests to be fairly similar, at 
least if the null hypothesis is true. 


10.6 The Asymptotic Theory of the Three Classical Tests 


In this fairly advanced section, we show that the three classical test statistics 
tend asymptotically to the same random variable. This is true both under 
the null hypothesis and under alternatives that are close to the null in a sense 
to be made precise later. The proof, which is limited to the former case, 
involves obtaining expressions for the probability limits of all three statistics 
in terms of the asymptotic information matrix J = J(@)) and the asymptotic 
score vector s = plim n™t/?g(0o). To avoid cluttering the notation, we omit 
zero subscripts. The results will be developed explicitly only for restrictions 
of the form 62 = 0, but they apply quite generally. 
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By a second-order Taylor expansion of 6(0) around 6, we obtain 
(8) = £(6) + Z (8 — 6)'H(6)(6 — 6), 


where @ is defined as usual in such an expansion. The first-order term vanishes 
because of the likelihood equations g(@) = 0. It follows that 


LR = 2(¢(6) — €(6)) = —(6 — 6)"H(6)(6 — ô). 


The information matrix equality and the consistency of 6, which implies the 
consistency of 0, then yield the result that 


LR £ n(6 — ô)" (ð — ô). (10.76) 


When we take the limit of (10.76), we can use the asymptotic equalities 
(10.38) and (10.64) to eliminate the estimators that appear in (10.76), re- 
placing them by expressions that involve only the asymptotic information 
matrix and asymptotic score vector, as follows: 


plim n‘/?(6 — 8) = plim n‘/?(6 — 8o) — plim nt? (ð — 00) 


=] = 1 Si (10.77) 


Here J,, and sı denote, respectively, the (k — r) x (k — r) block of J and the 
subvector of s that correspond to 01. We rewrite the last expression in (10.77) 
as Js, where the k x k symmetric matrix J is defined as 


i O 
J= aad en 10.78 
ines (10.78) 
Using (10.78), the probability limit of (10.76) is seen to be 
plim LR = s'JJJs. (10.79) 


Moreover, from (10.78), we have that 
Jir | E | | O O | 
IJ =p = ; 10.80 
" a J || O° OJ data Te, oe 


where the suffixes on the two identity matrices above indicate their dimen- 
sions. If we denote the last k x k matrix in (10.80) by Q, (10.80) can be 
written simply as JJ = Q. This in turn implies that J-'Q = J, and, since 


k oll (0) o]=0 
O Ojl-J3aJ O] ~’ 
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it follows from (10.78) that JQ = J. This implies that JJ J = J, from which 
we conclude that (10.79) can be written as 

plim LR = s'Js. (10.81) 
This expression, together with the definition (10.78) of the matrix J, shows 
clearly how k — r of the k degrees of freedom of s'J~1s are used up by the 
process of estimating 0; under the null hypothesis. 
We now go through a similar exercise for the LM statistic, all variants of 
which are asymptotically equal to the statistic in (10.69). Consider the last 
line of (10.66). If we stack the restricted likelihood equations, gi(@) = 0, on 
top of this, and use the definitions of Q and s, we find that (10.66) can be 
written as 7 

plim n‘/?g(6) = Qs. 


N— CO 


We then see from (10.69) that 
plim LM = s'Q'IJ-'Qs = s'JIJs = s'Js, (10.82) 


N— Co 


since JIQ = J and JIJ = J by our earlier results. The asymptotic equiv- 
alence of the LR and LM statistics follows from (10.81) and (10.82). 


The Wald statistic (10.60), for the case of zero restrictions, can be written as 
A A =1 A 
W= 6. ((Î 122) 02, 
and the limit of the statistic can therefore be expressed as 


plim W = plim n!/2ĝ; ar plim n!/?65. (10.83) 


n—> oo n— Co n—> oo 


When we were developing the LM statistic in the previous section, we saw that 
the inverse of the 22 block of J~' was equal to the last expression in (10.67). 
From the middle expression in (10.67), we then obtain 


fo ea] Lana tI be llo a 


= QIQ". 
Thus (10.83) becomes, by use of (10.38), 
plim W = plim n!/2 (ô — 0o) 'QIQ' plim nt/2(ĝ — 00) 
= s'JIQIQ' Its = s'JJ Js = 8'Js, 


where we have made use of the relations among J, J, and Q that have previ- 
ously been established. This result shows that all three classical test statistics 
tend to the same limiting random variable, namely, s'Js. 
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The Three Classical Tests when the Null is False 


The asymptotic equivalence result that we have just proved depends on the 
assumption that the DGP belongs to the null hypothesis. However, the three 
classical tests will yield asymptotically equivalent inferences only if the equiv- 
alence holds generally, and not just under the null hypothesis. 


A test is said to be consistent against a DGP that does not belong to the 
null hypothesis if, under that DGP, the power of the test tends to 1 as the 
sample size tends to infinity. We saw in Section 4.7 that, if the null and 
alternative hypotheses are classical normal linear models, power is determined 
by a noncentrality parameter that must tend to infinity for power to tend to 1. 
The three classical tests have a property similar to that of the exact tests of the 
classical normal linear model: Under DGPs in the alternative but not in the 
null, the classical test statistics tend to random variables that are distributed 
as noncentral chi-squared with r degrees of freedom, where the noncentrality 
parameters tend to infinity with the sample size. 


If all three classical tests can be shown to be consistent against a given DGP, 
then they are asymptotically equivalent under this DGP in the sense that, 
as n — oo, power tends to 1. But this does not rule out the possibility 
that, in finite samples, one of the tests may be much more powerful than the 
others. In order to investigate such a possibility, we want to develop a version 
of asymptotic theory in which the powers of different tests tend to different 
limits as n — oo if they have very different powers in finite samples. 


The simplest case we can study is that of the t statistic for the restriction 
G2 = 0 in the linear regression model 


y = X13, + £2b2 + U. 


The noncentrality parameter A of the t statistic, in finite samples, is given as 
a function of 8> and the error variance o? in equation (4.72), which we repeat 
here for convenience: 


A= i (aq Mi22)'/? 32. 


For fixed G2 and o, A tends to infinity as n — ov, since, under the regularity 
conditions for the classical normal linear model, n~!a| M22 tends to a finite 
limit, which we denote by S,774,2,- It follows that n—'/2) tends to a finite 
limit, rather than A itself. But if, instead of keeping (2 fixed, we subject it 
to what is called a Pitman drift, we can obtain a different result. Let 6 be a 
fixed parameter, and, for each sample size n, let G2 = n~!/25. We then find 
that 


d= n "l (eT Mæ)? 8 = 2 (nea Mim)? > È Se7mM, sa 


Since the limit of A is no longer infinite, we can compare the possibly different 
limits obtained for different test statistics. A DGP for which the parameters 
depend explicitly on the sample size is called a drifting DGP. 
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If the model that corresponds to the alternative hypothesis is characterized 
by the loglikelihood function ¢(6;, 62), and the null hypothesis is the set of r 
zero restrictions 82 = 0, an appropriate drifting DGP for studying power is 
one for which 84 is fixed and 05 is given by n~!/?6 for a fixed r-vector 6. It 
can then be shown that, under this drifting DGP, just as under the null, the 
LR, LM, and Wald statistics tend as n — oo to the same random variable, 
which follows a noncentral x?(r) distribution; see Exercise 10.19 for a very 
simple example. More generally, as discussed by Davidson and MacKinnon 
(1987), we can allow for drifting DGPs that do not lie within the alternative 
hypothesis, but that drift toward some fixed DGP in the null hypothesis. 
It then turns out that, for drifting DGPs that are, in an appropriate sense, 
equally distant from the null, the noncentrality parameter is maximized by 
those DGPs that do lie within the alternative hypothesis. This result justifies 
the intuition that, for a given number of degrees of freedom, tests against an 
alternative which happens to be true will have more power than tests against 
other alternatives. 


10.7 ML Estimation of Models with Autoregressive Errors 


In Section 7.8, we discussed several methods based on generalized or nonlinear 
least squares for estimating linear regression models with error terms that 
follow an autoregressive process. An alternative approach is to use maximum 
likelihood. If it is assumed that the innovations are normally distributed, 
ML estimation is quite straightforward. With the normality assumption, the 
model (7.40) considered in Sections 7.7 and 7.8 can be written as 


Ut = XB +Ut, Ut = pUut—1 FEt, Et™~ NID(0, o2), (10.84) 


in which the error terms follow an AR(1) process with parameter p that is 
assumed to be less than 1 in absolute value. If we omit the first observation, 
this model can be rewritten as in equation (7.41). The result is just a nonlinear 
regression model, and so, as we saw in Section 10.2, the ML estimates of G 
and p must coincide with the NLS ones. 


Maximum likelihood estimation of (10.84) is more interesting if we do not omit 
the first observation, because, in that case, the ML estimates no longer coin- 
cide with either the NLS or the GLS estimates. For observations 2 through n, 
the contributions to the loglikelihood can be written as in (10.09): 


L(y’, B, P, Oz) = 


1 
= 5 log 27 — logo, — Fg2 ue — PUA = Ape pX1-1)?. 
E 


(10.85) 


As required by (10.24), this expression is the log of the density of y+ conditional 
on the lagged dependent variable y;_1. 
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For the first observation, the only information we have is that 
y= Xıß + u, 


since the lagged dependent variable yọ is not observed. However, with the 
normality assumption, we know from Section 7.8 that u; ~ N (0, o2/(1 — p°)). 
Thus the loglikelihood contribution from the first observation is the log of the 
density of that distribution, namely, 


Aly, ics P; Oz) = 
(10.86) 


=p? 
— = log 2m — log o; + 4 log (1 — p°) aoe (Mt X8)’. 


The loglikelihood function for the model (10.84) based on the entire sample 
is obtained by adding the contribution (10.86) to the sum of the contribu- 
tions (10.85), for t = 2,...,n. The result is 


L(y, B, p, 02) = — Blog 2r — n log ds + 5 log(1 - °) (10.87) 


1 
202 


n 


(a — p(y. — X18)? + X (u — PY-1 — Xb + PXi-10)). 


t=2 


The term 4 log(1 — p°) that appears in (10.87) plays an extremely important 
role in ML estimation. Because it tends to minus infinity as p tends to +1, 
its presence in the loglikelihood function ensures that there must be a maxi- 
mum within the stationarity region defined by |p| < 1. Therefore, maximum 
likelihood estimation using the full sample is guaranteed to yield an estimate 
of p for which the AR(1) process is stationary. This is not the case for any of 
the estimation techniques discussed in Section 7.8. 


A 


Let us define (68) as yt — X; 6B for t = 1,...,n, and let a = u( 8B). Then, 
from the first-order conditions for the maximization of (10.87), it can be seen 
that the ML estimators 3, ô, and 62 satisfy the following equations: 


(1 — S°) Xil ü +X (X: — 6X11) "(te — Ptr) = 0, 


a2 n 
E 


1— ĝ? + X r-a (tte — Îôûr—1ı) = 0, and (10.88) 
t=2 


The first two of these equations are similar, but not identical, to the estimating 
equations (7.70) developed in Section 7.8 for iterated feasible GLS or NLS 
with account taken of the first observation. In Exercise 10.21, an artificial 
regression is developed which makes it quite easy to solve equations (10.88). 
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10.8 Transformations of the Dependent Variable 


Whenever we specify a regression model, one of the choices we implicitly 
have to make is whether, and how, to transform the dependent variable. For 
example, if y+, a typical observation on the dependent variable, is always 
positive, it would be perfectly valid to use logy, or y}/?, or one of many 
other monotonically increasing nonlinear transformations, instead of y+ itself 
as the regressand. 


For concreteness, let us suppose that there are just two alternative models, 
which we will refer to as Model 1 and Model 2: 


yı = Xn GBitu, u ~ NID(0,o?), and 
log yt = Xi2ß2 +v, vz ~ NID(0, o2). 


Precisely how the regressors of the two competing models are related need not 
concern us here. In many cases, some of the regressors for one model will be 
transformations of some of the regressors for the other model. For example, 
X might consist of a constant and z, and X2 might consist of a constant 
and log z;. Model 2 is often called a loglinear regression model. 


Although we may be able to specify plausible-looking regression models for 
a number of different transformations of the dependent variable, using any 
model except the correct one will, in general, imply that the error terms are 
neither normally nor identically distributed. For example, suppose that we 
estimate Model 1 when the data were actually generated by Model 2 with 
parameters 329 and 03. It follows that 


Yt = exp(Xi2 b20 + v) 
= exp(X72320) exp(v:) (10.89) 


= exp( X2 620) expl 4020) + exp(X+2320) (exp(v:) — exp(40%)). 


The last line here uses the fact that exp(v;) is a lognormal variable, of which 
the expectation is exp(o3)/2); recall Exercise 9.19. Thus the first term in the 
last line is the conditional mean of y+, and so the second term, which is y 
minus this conditional mean, is the error term for Model 1. 


Even if it should turn out that X;,G,, the regression function for Model 1, 
can provide a reasonably good approximation to the conditional mean in the 
last line of (10.89), the error terms for that model cannot possibly have the 
properties we generally assume them to have. If the error terms in Model 2 
are normally and identically distributed, then the error terms in Model 1 
must be skewed to the right and heteroskedastic. Their skewness is a con- 
sequence of the fact that lognormal variables are always skewed to the right 
(see Exercise 10.20). Because their variance is proportional to the square of 
exp(X12320), they are heteroskedastic. 
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As this example demonstrates, even when the errors in the DGP are normally, 
identically, and independently distributed, using the wrong transformation of 
the dependent variable as the regressand will, in general, yield a regression 
with error terms that are neither homoskedastic nor symmetric. Thus, when 
we encounter heteroskedasticity and skewness in the residuals of a regression, 
one possible way to eliminate them is to estimate a different regression model 
in which the dependent variable has been subjected to some sort of nonlinear 
transformation. 


Comparing Alternative Models 


It is perfectly easy to subject the dependent variable to various nonlinear 
transformations and estimate one or more regression models for each of them. 
However, least squares estimation does not provide any way to compare the 
fits of competing models that involve different transformations. But max- 
imum likelihood estimation under the assumption that the error terms are 
normally distributed does provide a straightforward way to do so. The idea is 
to compare the loglikelihoods of the alternative models considered as models 
for the same dependent variable. 


For Model 1, in which y is the regressand, the concentrated loglikelihood 
function is simply 


ig, logri Hog Yiu — Xap’); (10.90) 


t=1 


Expression (10.90) is just expression (10.11) specialized to Model 1. Most 
regression packages will report the value of (10.90) evaluated at the OLS 
estimates as the maximized value of the loglikelihood function. 


In order to construct the loglikelihood function for the loglinear Model 2, 
interpreted as a model for y, rather than for log y;, we need the density of y 
as a function of the model parameters. This requires us to use a standard 
result about transformations of variables. Suppose that we wish to know the 
CDF of a random variable X, but that what we actually know is the CDF of 
a random variable Z defined as Z = h(X), where h(-) is a strictly increasing 
deterministic function. Denote this known CDF by Fz. Then we can obtain 
the CDF Fx of X as follows. 


Fy(@) = Pr(X < x) = Pr(h(X) < h(x)) 
= Pr(Z < h(x)) = Fz(h(x)). (10.91) 


The second equality above follows because h(-) is strictly increasing. The 
relation between the densities, or PDFs, of the variables X and Z is obtained 
by differentiating the leftmost and rightmost quantities in (10.91) with respect 
to x. Denoting the PDFs by fx(-) and fz(-), we obtain 


fx (x) = Fx (x) = Fz(h(2))h’() = fz(h(2))h (2). 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


10.8 Transformations of the Dependent Variable 431 


If h is strictly decreasing, the above result must be modified so as to use the 
absolute value of the derivative. As readers are asked to show in Exercise 
10.23, the result then becomes 


fx(®) = fa(h(a))|h'(@)]. (10.92) 


It is not difficult to see that (10.92) is a perfectly general result which holds 
for any strictly monotonic function h. 


The factor by which fz(z) is multiplied in order to produce fx(z) is the abso- 
lute value of what is called the Jacobian of the transformation. For Model 2, 
X is replaced by y+, and the transformation h is the logarithm, so that Z 
becomes log y+. The density of y; is then given by (10.92) in terms of that of 
log yt: 

l 
— flogy) (10.93) 

Yt 

where we drop subscripts and denote the PDFs of y; and log y; by f(yz) and 
f (log yz), respectively. 
We can now compute the loglikelihood for Model 2 thought of as a model for 
the y+. The concentrated loglikelihood for the log y: is given by (10.11): 


dlog yt 
dyr 


Hes =Fibe w) 


— 5 log 27T — = — *oe( 9 (log Yt — Xap); (10.94) 


t=1 


This expression is the log of the product of the densities of the log y+. Since 
the density of y+, by (10.93), is equal to 1/y,; times the density of log ys, the 
loglikelihood function we are seeking is 


—Plog2n- 2-2 log( (oe Yi — Xnf2)’) — X log ye. (10.95) 
t=1 


t=1 


The last term here is a Jacobian term. It is the sum over all t of the logarithm 
of the Jacobian factor 1/y; in the density of y+. This Jacobian term is abso- 
lutely critical. If it were omitted, Model 2 would be a model for logy, and it 
would make no sense to compare the value of the loglikelihood for (10.94) with 
the value for Model 1, which is a model for y4. But when the Jacobian term is 
included, the loglikelihoods for both models are expressed in terms of y;, and 
it is perfectly valid to compare their values. We can say with confidence that 
the model corresponding to whichever of (10.90) and (10.95) has the largest 
value is the model that better fits the data. 


Most regression packages will evaluate (10.94) at the OLS estimates for the 
loglinear model and report that as the maximized value of the loglikelihood. 
In order to compute the loglikelihood (10.95), which is what we need if we are 
to compare the fits of the linear and loglinear models, we will have to add the 
Jacobian term to the value reported by the package. 
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Of course, the logarithmic transformation is by no means the only one that 
we might employ in practice. For example, when the y; are sharply skewed 
to the right, a transformation like \/y; might make sense; see Exercise 10.28. 


Weighted least squares also involves transforming the dependent variable. If 
we believe that the error variance is proportional to w?, the use of feasible 
GLS leads us to divide y; and all the regressors by w. When this is done, 
the Jacobian of the transformation is just 1/w:, and the Jacobian term in the 
loglikelihood function is 


-X log ux. (10.96) 
t=1 


In order to compare a model that has y as the regressand with another 
model that has y,/w; as the regressand, we need to add (10.96) to the value 
of the loglikelihood reported for the second model. Doing this makes the 
loglikelihoods from the two models comparable. If it really is appropriate to 
use weighted least squares, then the loglikelihood function for the weighted 
model should be higher than the loglikelihood function for the original model. 


The most common nonlinear transformation in econometrics is the logarithmic 
transformation. Very often, we may find ourselves estimating a number of 
models, some of which have y as the regressand and some of which have 
logy, as the regressand. If we simply want to decide which model fits best, 
we already know how to do so. We just have to compute the loglikelihood 
function for each of the models, including the Jacobian term — X`; log yz for 
models in which the regressand is log y, and pick the model with the highest 
loglikelihood. But if we want to perform a formal statistical test, and perhaps 
reject one or more of the competing models as incompatible with the data, 
we must go beyond simply comparing loglikelihood values. 


The Box-Cox Regression Model 


Most procedures for testing linear and loglinear models make use of the Box- 
Cox transformation, 


pad when A # 0; 
Beh) = À ? 


log x when A = 0, 


where A is a parameter, which may be of either sign, and x, the argument of 
the transformation, must be positive. By l’Hopital’s Rule, log x is the limit 
of (xà —1)/X as à > 0. Figure 10.3 shows the Box-Cox transformation for 
various values of A. In practice, A generally ranges from somewhat below 0 to 
somewhat above 1. It can be shown that B(x, A’) > B(x, A”) for X > A”, and 
this inequality is evident in the figure. Thus the amount of curvature induced 
by the Box-Cox transformation increases as gets farther from 1 in either 
direction. 
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B(x, A) 
2 


Curves (from highest to lowest) correspond to 
—3 A = 1.5, 1, 0.5, 0, —0.5 and —1. 


0 1 2 3 4 


Figure 10.3 Box-Cox transformations for various values of A 


For the purposes of this section, the important thing about the Box-Cox 
transformation is that it allows us to formulate models which include both 
linear and loglinear regression models as special cases. In particular, consider 
the Box-Cox regression model 


ky k 
Bly, A) => BiZit XO biB(Xu, A) +u, w ~ NID(0,0°), (10.97) 
i=1 i=kı+1 


in which there are kı regressors Z; that are not subject to transformation 
and ky = k — kı nonconstant regressors X;; that are always positive and are 
subject to transformation. The Z,; would include the constant term, if any, 
in addition to dummy variables and any other regressors that can take on 
nonpositive values. When A = 1, this model reduces to the linear regression 
model 


ky k 
w-l=S 6 Zit SY > Bi(Xi-1+u, u~ NID(0,o?). 
i=1 i=kı+1 


Provided there is a constant term, or the equivalent of a constant term, among 
the Zt regressors, this is equivalent to 


kı k 
i= `> Bi Zea + Ss BiXti + ue, us ~ NID(0, o°), (10.98) 
i=1 i=kı+1 
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with the 8; corresponding to the constant term redefined in the obvious way. 
When A = 0, on the other hand, the Box-Cox model (10.97) reduces to the 
loglinear regression model 


kı k 
log y; = ` BiZi + ` Bilog Xni + us, us ~ NID(0, o°). (10.99) 
i=1 i=kı+1 


Thus it is clear that the linear regression model (10.98) and the loglinear 
regression model (10.99) can both be obtained as special cases of the Box- 
Cox regression model (10.97). 


Testing Linear and Loglinear Regression Models 


There are many ways in which we can test (10.98) and (10.99) against (10.97). 
Conceptually, the simplest is just to estimate all three models and perform 
two likelihood ratio tests. Let ¢(\) denote the maximum of the loglikelihood 
function for the unrestricted Box-Cox model (10.97), which readers are asked 
to derive in Exercise 10.29. Similarly, let (1) and (0) denote the maxima of 
the loglikelihood functions for the linear and loglinear models, respectively. 
Then the statistics for testing the linear and loglinear models against the 
Box-Cox regression model are 


2(é(A) — &(1)) and 2(€(A) — £(0)), 


respectively. If either of these statistics exceeds y?_,(1), the 1 — a quantile 
of the x?(1) distribution, we may reject the model being tested at level a. 
In practice, this test tends to be quite powerful in samples of even moderate 
size, since it does not require a very large test statistic in order to reject the 
null hypothesis; the two most widely-used critical values are x2 9;(1) = 3.84 
and x2 99(1) = 6.63. 


This procedure is conceptually very simple, but it requires us to estimate A, 
which is a bit more work than simply running a linear regression. In some 


cases, however, we can avoid estimating A. We know that (A) must be larger 
than whichever of (1) and (0) is larger. Therefore, if 


2(€(0) — €(1)) > x7_4(), (10.100) 


we can certainly reject the linear model, even though we have not actually 
estimated the Box-Cox model or computed the LR test statistic. Similarly, if 


2(é(1) — £(0)) > Xï -a(1), (10.101) 


we can certainly reject the loglinear model. The quantities (10.100) and 
(10.101) provide lower bounds for the actual LR statistics. In practice, these 
lower bounds can often allow us to rule out models that are clearly incompat- 
ible with the data. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


10.9 Final Remarks 435 


The fact that one can sometimes put a lower bound on the LR test statistic 
without actually estimating the unrestricted model is often very convenient. 
It was noted by Sargan (1964) in the context of choosing between linear and 
loglinear models, is widely used by applied workers, and has been proposed as 
a general basis for model selection by Pollak and Wales (1991). The procedure 
works in only one direction, of course. If, for example, (10.100) allows us to 
reject the linear model, then it tells us nothing about whether the loglinear 
model is acceptable to the data. 


Lagrange Multiplier Tests 


Since it is very easy to estimate linear and loglinear regression models, but 
somewhat harder to estimate the Box-Cox regression model, it is natural to 
use LM tests in this context. The first tests of this type were proposed by 
Godfrey and Wickens (1981). They are based on the OPG regression (10.72). 
However, as is often the case with tests based on the OPG regression, these 
tests tend to overreject quite severely in finite samples. Therefore, David- 
son and MacKinnon (1985b) proposed Lagrange multiplier tests based on the 
double-length artificial regression, or DLR, that they had previously devel- 
oped in Davidson and MacKinnon (1984a). This artificial regression is called 
“double-length” because it has 2n “observations,” two for each of the actual 
observations in the sample. 


For reasons of space, we will not write down the OPG or DLR test regressions 
here. Readers are asked to derive a special case of the former in Exercise 10.29. 
The latter, which are somewhat more complicated, are discussed in detail in 
Davidson and MacKinnon (1993, Chapter 14). If an LM test is to be used, 
we recommend use of the DLR rather than the OPG variant. There is a 
good deal of evidence that the DLR variant is much more reliable in finite 
samples; see Davidson and MacKinnon (1984b) and Godfrey, McAleer, and 
McKenzie (1988), among others. Of course, either variant of the test may 
easily be bootstrapped, as discussed in Section 10.6, and the OPG variant 
should perform acceptably when that is done. Because it is never necessary 
to estimate the unrestricted model, bootstrapping either of the LM tests will 
be considerably less expensive than bootstrapping the LR test. 


10.9 Final Remarks 


Maximum likelihood estimation is widely used in many areas of econometrics, 
and we will encounter a number of important applications in the next four 
chapters. Readers seeking a more advanced treatment of the theory than we 
were able to give in this chapter may wish to consult Davidson and MacKinnon 
(1993), Cox and Hinkley (1974), or Stuart, Ord, and Arnold (1998). 


As we have seen, ML estimation has many good properties, although these 
may be more apparent asymptotically than in finite samples. Its biggest limit- 
ation is the need for a fully specified parametric model. However, even if the 
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dependent variable does not follow its assumed distribution, quasi-maximum 
likelihood estimators may still be consistent, although they will not be asymp- 
totically efficient. 


10.10 Exercises 


10.1 


Show that the ML estimator of the parameters G and ø of the classical normal 
linear model can be obtained by first concentrating the loglikelihood with 
respect to @ and then maximizing the concentrated loglikelihood thereby 
obtained with respect to oc. 


Let the n-vector y be a vector of mutually independent realizations from the 
uniform distribution on the interval [G,, 32], usually denoted by U (61, 62). 
Thus, yt ~ U(G1, G2) fort = 1,...,n. Let G1 be the ML estimator of 61 given 
in (10.13), and suppose that the true values of the parameters are 3; = 0 and 
B = 1. Show that the CDF of ĝi is 


F(B) = Pr(ĝi < 6) =1-(1- 6)”. 


Use this result to show that n(ĝ1 — 610), which in this case is just nÂ, is 
asymptotically exponentially distributed with 0 = 1. Note that the PDF of 
the exponential distribution was given in (10.03). (Hint: The limit as n — oo 
of (1 + z/n)”, for arbitrary real x, is e”.) 


Show that, for arbitrary given G19 and 620, with B20 > G10, the asymp- 
totic distribution of n(ĝiı — 610) is characterized by the density (10.03) with 
0 = (B20 — b10) +. 

Generate 10,000 random samples of sizes 20, 100, and 500 from the uniform 
U (0,1) distribution. For each sample, compute 7, the sample mean, and 4, 
the average of the largest and smallest observations. Calculate the root mean 
squared error of each of these estimators for each of the three sample sizes. 
Do the results accord with what theory predicts? 


Suppose that h(-) is a strictly concave, twice continuously differentiable, func- 
tion on a possibly infinite interval of the real line. Let X be a random variable 
of which the support is contained in that interval. Suppose further that the 
first two moments of X exist. Prove Jensen’s Inequality for the random vari- 
able X and the strictly concave function h by performing a Taylor expansion 
of h about E(X). 


Prove that the definition (10.31) of the information matrix is equivalent to 
the definition 


= 
Hint: Use the result (10.30). 
By differentiating the identity (10.28) with respect to 0j, show that 


Eo (Gti(y’, 0)Gi (y", 0) + (Hi)iz(y", 0)) = 0, (10.102) 


where the k x k matrix H:(y’,@) is the Hessian of the contribution ¢:(y’, 0) 
to the loglikelihood. Show that (10.102) also holds if the left-hand side is the 
expectation conditional on | 
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10.7 


10.8 


10.10 


10.11 


10.12 


Use the result (10.102) of the preceding exercise to prove the asymptotic 
information matrix equality (10.34). 


Consider the linear regression model with exogenous explanatory variables, 
y=XB+u, 


where the only assumptions made regarding the error terms are that they 
are uncorrelated and have mean zero and finite variances that are, in general, 
different for each observation. The OLS estimator, which is consistent for this 
model, is equal to the ML estimator of the model under the assumption of 
homoskedastic normal error terms. The ML estimator is therefore a QMLE 
for this model. Show that the k x k block of the sandwich covariance matrix 
estimator (10.45) that corresponds to Ê is a version of the HCCME for the 
linear regression model. 


Write out explicitly the empirical Hessian estimator of the covariance matrix 
of B and o for the classical normal linear model. How is it related to the IM 
estimator (10.53)? 


How would your answer change if XG in the classical normal linear model were 
replaced by a(@), a vector of nonlinear regression functions that implicitly 
depend on exogenous variables? 


Suppose you treat o° instead of o asa parameter. Use arguments similar to 
the ones that led to (10.53) to derive the information matrix estimator of the 
covariance matrix of B and 6”. Then show that the same estimator can also 
be obtained by using the delta method. 


Explain how to compute two different 95% confidence intervals for o°. One 
should be based on the covariance matrix estimator obtained in Exercise 
10.10, and the other should be based on the original estimator (10.53). Will 
both of the intervals be symmetric? Which seems more reasonable? 


Let ð denote any unbiased estimator of the k parameters of a parametric 
model fully specified by the loglikelihood function ¢(@). The unbiasedness 
property can be expressed as the following identity: 


Juwa =9. (10.103) 


By using the relationship between L(y, 0) and (y, 0) and differentiating this 
identity with respect to the components of 0, show that 


Cove(g(0), (6 — 0)) = I, 
where I is a k x k identity matrix, and the notation Covg indicates that the 


covariance is to be calculated under the DGP characterized by 0. 


Let V denote the 2k x 2k covariance matrix of the 2k-vector obtained by 
stacking the k components of g(0) above the k components of 8 — 0. Partition 
this matrix into 4 k x k blocks as follows: 


v C 
Vaz , 
C Vv 
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where V; and V2 are, respectively, the covariance matrices of the vectors g(@) 
and @ — 8 under the DGP characterized by 0. Then use the fact that V is pos- 
itive semidefinite to show that the difference between V2 and I ~1(0), where 
I(@) is the (finite-sample) information matrix for the model, is a positive 
semidefinite matrix. Hint: Use the result of Exercise 7.11. 


Consider the linear regression model 
y = X18, + XeBot+u, u~ N(0,07I). (10.104) 


Derive the Wald statistic for the hypothesis that Gg = 0, as a function of the 
data, from the general formula (10.60). Show that it would be numerically 
identical to the Wald statistic (6.71) if the same estimate of o? were used. 


Show that, if the estimate of g? is either the OLS or the ML estimator based on 
the unrestricted model (10.104), the Wald statistic is a deterministic, strictly 
increasing, function of the conventional F statistic. Give the explicit form of 
this deterministic function. Why can one reasonably expect that this result 
holds for tests of arbitrary linear restrictions on the parameters, and not only 
for zero restrictions of the type considered in this exercise? 


The model specified by the loglikelihood function (0) is said to be repara- 
metrized if the parameter vector 0 is replaced by another parameter vector @ 
related to @ by a one to one relationship 0 = O(@) with inverse ¢ = O~ (6). 
The loglikelihood function for the reparametrized model is then defined as 
e'(@) = €(O(¢)). Explain why this definition makes sense. 


Show that the maximum likelihood estimates Q of the reparametrized model 
are related to the estimates Ê of the original model by the relation ô = ole). 
Specify the relationship between the gradients and information matrices of the 
two models in terms of the derivatives of the components of @ with respect 
to those of 8. 


Suppose that it is wished to test a set of r restrictions written as r(@) = 0. 
These restrictions can be applied to the reparametrized model in the form 
r'(~) = r(O(¢)) = 0. Show that the LR statistic is invariant to whether 
the restrictions are tested for the original or the reparametrized model. Show 
that the same is true for the LM statistic (10.69). 


Show that the artificial OPG regression (10.73) possesses all the properties 
needed for hypothesis testing in the context of a model estimated by maximum 
likelihood. Specifically, show that 


e the regressand + is orthogonal to the regressors G(@) when the latter are 
evaluated at the MLE 6; 


e the estimated OLS covariance matrix from (10.73) evaluated at 6, when mul- 
tiplied by n, consistently estimates the inverse of the asymptotic information 
matrix; 


e the OPG regression (10.73) allows one-step estimation: If the OLS para- 
meter estimates é from (10.73) are evaluated at 0 = É, where É is any root-n 
consistent estimator of 0, then the one-step estimator Ò=6É+ćis asymptot- 
ically equivalent to Ê, in the sense that nt/2 (ò — 69) and ni/2(6 — 0o) tend 
to the same random variable as n — oo. 
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10.16 


10.17 


10.18 


10.19 


10.20 


10.21 


Show that the explained sum of squares from the artificial OPG regression 
(10.73) is equal to n times the uncentered R? from the same regression. Relate 
this fact to the use of test statistics that take the form of n times the R? of 
a GNR (Section 6.7) or of an IVGNR (Section 8.6 and Exercise 8.21). 


Express the LM statistic (10.74) as a deterministic, strictly increasing, func- 
tion of the F statistic (10.57). 


Consider a model characterized by a loglikelihood function &(y,6), where 6 
is a scalar parameter. Suppose there is a particular data set y such that the 
loglikelihood of y is a quadratic function of 0: 


(0) = a0 — Sho”. (10.105) 


Compute the three classical test statistics for the hypothesis that 0 = 0. For 
the Wald and LM tests, use the information matrix estimate of the variance 
of 6. Show that the three test statistics are equal. Graph the loglikelihood 
function (10.105) and interpret the constituent elements of the three statistics 
geometrically. 


Let the loglikelihood function (0) depend on one scalar parameter 6. For 
this special case, consider the distribution of the LM statistic (10.69) under 
the drifting DGP characterized by the parameter 0 = n— 1/26 for a fixed ô. 
This DGP drifts toward the fixed DGP with 0 = 0, which we think of as 
representing the null hypothesis. Show first that nH (n28) — J(0) as 
n — co. Here the asymptotic information matrix J(@) is just a scalar, since 
there is only one parameter. 


Next show that n~!/? times the gradient, evaluated at 0 = 0, which we 
may write as nV 29(0), is asymptotically normally distributed with mean 
65(0) and variance J(0). Finally, show that the LM statistic is asymptotically 
distributed as x?(1) with a finite noncentrality parameter, and give the value 
of that noncentrality parameter. 


Let z ~ N(p, o°), and consider the lognormal random variable x = e*. Using 
the result that 

E(e*) = exp(u+ 1a); (10.106) 
compute the second, third, and fourth central moments of x. Show that x is 
skewed to the right and has positive excess kurtosis. 


Note: The excess kurtosis of a random variable is formally defined as the ratio 
of the fourth central moment to the square of the variance, minus 3. 


The GNR proposed in Section 7.8 for NLS estimation of the model (10.84) 
can be written schematically as 


(1 — p?)'/? u (8) | _ Ee -PPX 0 | | b 
ut(B) = put—1(B) Xt — pXt-1 ur—1() bp 


where uz(G) = yt— Xt fort = 1,...,n, and the last n—1 rows of the artificial 
variables are indicated by their typical elements. Append one extra artificial 
observation to this artificial regression. For this observation, the regressand 
is ((1—p?)u?(B)/oe — oe) / V2, the regressor in the column corresponding to p 
is pocV2/(1 = p’), and the regressors in the columns corresponding to the 


| + residuals, 
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elements of 8 are all 0. Show that, if at each iteration o2 is updated by the 
formula 


o? = (0 -PAO +Y (O) - pur1(8))), 
2 


t= 


then, if the iterations defined by the augmented artificial regression converge, 
the resulting parameter estimates satisfy the estimating equations (10.88) 
that define the ML estimator. 


The odd-looking factors of 2 in the extra observation are there for a reason: 
Show that, when the artificial regression has converged, og 2 times the matrix 
of cross-products of the regressors is equivalent to the block of the information 
matrix that corresponds to 8 and p evaluated at the ML estimates. Explain 
why this means that we can use the OLS covariance matrix from the artificial 
regression to estimate the covariance matrix of B and Ô. 


Using the artificial data in the file arl.data, estimate the linear regression 
model 


Yt = b1 + Poxe+ut, Ut = put—1 +Et, t=1,...,100, 


which is correctly specified, in two different ways: ML omitting the first 
observation, and ML using all 100 observations. The second method will 
yield more efficient estimates of G1; and 82. For each of these two parameters, 
how large a sample of observations similar to the last 99 observations would 
be needed to obtain estimates as efficient as those obtained by using all 100 
observations? Explain why your answer is greater than 100 in both cases. 


Let the two random variables X and Z be related by the deterministic equa- 
tion Z = h(X), where h is strictly decreasing. Show that the PDFs of the 
two variables satisfy the equation 


fx(2) = —fz(h(x))h'(e). 


Then show that (10.92) holds whenever h is a strictly monotonic function. 


Let X = 2°. Express the density of X in terms of that of Z, taking account of 
the possibility that the support of Z may include negative as well as positive 
numbers. 


Suppose that a dependent variable y follows the exponential distribution given 
in (10.03), and let £z = y’. What is the density of x? Find the ML estimator 
of the parameter 0 based on a sample of n observations +; which are assumed 
to follow the distribution of which you have just obtained the density. 


For a sample of n observations y; generated from the exponential distribution, 
the loglikelihood function is (10.04), and the ML estimator is (10.06). Derive 
the asymptotic information matrix J(0), which is actually a scalar in this case, 
and use it to show how n!/? (6 — fo) is distributed asymptotically. What is the 
empirical Hessian estimator of the variance of 6? What is the IM estimator? 


There is an alternative parametrization of the exponential distribution, in 
which the parameter is ¢ = 1/6. Write down the loglikelihood function in 
terms of ¢ and obtain the asymptotic distribution of nil 2(6 — ġo). What 
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10.26 


10.27 


10.28 


10.29 


10.30 


is the empirical Hessian estimator of the variance of Q? What is the IM 
estimator? 


Consider the ML estimator Ô from the previous exercise. Explain how you 
could obtain an asymptotic confidence interval for 0 in three different ways. 
The first should be based on inverting a Wald test in the 0 parametrization, 
the second should be based on inverting a Wald test in the ¢ parametrization, 
and the third should be based on inverting an LR test. 


Generate 100 observations from the exponential distribution with 0 = 0.5, find 
the ML estimate based on these artificial data, and calculate 95% confidence 
intervals for 0 using the three methods just proposed. Hint: To generate 
the data, use uniformly distributed random numbers and the inverse of the 
exponential CDF. 


Use the result (10.92) to derive the PDF of the N (u, o°) distribution from 
the PDF of the standard normal distribution. 


In the classical normal linear model as specified in (10.07), it is the distribu- 
tion of the error terms u that is specified rather than that of the dependent 
variable y. Reconstruct the loglikelihood function (10.10) starting from the 
densities of the error terms uz and using the Jacobians of the transformations 
that express the yz in terms of the ut. 


Consider the model 
yt? = XB +u, us ~ NID(0, 0°), 


in which it is assumed that all observations y¢ on the dependent variable are 
positive. Write down the loglikelihood function for this model. 


Derive the loglikelihood function for the Box-Cox regression model (10.97). 
Then consider the following special case: 


B(yt,) = b1 + G2B(at, A) + ur, ut ~ NID(0, 0°). 


Derive the OPG regression for this model and explain precisely how to use it 
to test the hypotheses that the DGP is linear (A = 1) and loglinear (A = 0). 


Consider the model (9.122) of the Canadian consumption function, with data 
from the file consumption.data, for the period 1953:1 to 1996:4. Compute the 
value of the maximized loglikelihood for this model regarded as a model for 
the level (not the log) of current consumption. 


Formulate a model with the same algebraic form as (9.122), but in levels of 
the income and consumption variables. Compute the maximized loglikelihood 
of this second model, and compare it with the value you obtained for the 
model in logs. Can you draw any conclusion about whether either model is 
misspecified? 


Formulate a third model, using the variables in levels, but dividing them all 
by current income Y; in order to account for heteroskedasticity. The result 
will be a weighted least squares model. Compute the maximized loglikelihood 
for this model as a model for the level of current consumption. Are there any 
more conclusions you can draw on the basis of your results? 
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Formulate a Box-Cox regression model which includes the first and second 
models of the previous exercise as special cases. Use the OPG regression to 
perform an LM test of the hypothesis that the Box-Cox parameter A = 0, that 
is, that the loglinear model is correctly specified. Obtain both asymptotic and 
bootstrap P values. 


The model (9.122) that was estimated in Exercise 10.30 can be written as 
Act = b1 + B2Aye + b3 Ayt- + et, 


where ¢; ~ NID(0,1). Suppose now that the ez, instead of being standard 
normal, follow the Cauchy distribution, with density f(ez) = (n(1 + a 
Estimate the resulting model by maximum likelihood, and compare the max- 
imized value of the loglikelihood with the one obtained in Exercise 9.12. 


Suppose that the dependent variable y is a proportion, so that 0 < y < 1, 
t=1,...,n. An appropriate model for such a dependent variable is 


log( E ) = Xb + ut, 
1— yt 


where Xz is a k x 1 vector of exogenous variables, and 8 is a k-vector. Write 
down the loglikelihood function for this model under the assumption that 
ut ~ NID(0, 0°). 
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11.1 Introduction 


Although regression models are useful for modeling many types of data, they 
are not suitable for modeling every type. In particular, they should not be 
used when the dependent variable is discrete and can therefore take on only 
a countable number of values, or when it is continuous but is limited in the 
range of values it can take on. Since variables of these two types arise quite 
often, it is important to be able to deal with them, and a large number of 
models has been proposed for doing so. In this chapter, we discuss some of the 
simplest and most commonly used models for discrete and limited dependent 
variables. 


The most commonly encountered type of dependent variable that cannot be 
handled properly using a regression model is a binary dependent variable. 
Such a variable can take on only two values, which for practical reasons are 
almost always coded as 0 and 1. For example, a person may be in or out 
of the labor force, a commuter may drive to work or take public transit, a 
household may own or rent the home it resides in, and so on. In each case, 
the economic agent chooses between two alternatives, one of which is coded 
as 0 and one of which is coded as 1. A binary response model then tries to 
explain the probability that the agent will choose alternative 1 as a function 
of some observed explanatory variables. We discuss binary response models 
at some length in Sections 11.2 and 11.3 


A binary dependent variable is a special case of a discrete dependent variable. 
In Section 11.4, we briefly discuss several models for dealing with discrete 
dependent variables that can take on a fixed number of values. We consider 
two different cases, one in which the values have a natural ordering, and one 
in which they do not. Then, in Section 11.5, we discuss models for count data, 
in which the dependent variable can, in principle, take on any nonnegative, 
integer value. 


Sometimes, a dependent variable is continuous but can take on only a limited 
range of values. For example, most types of consumer spending can be zero 
or positive but cannot be negative. If we have a sample that includes some 
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zero observations, we need to use a model that explicitly allows for this. By 
the same token, if the zero observations are excluded from the sample, we 
need to take account of this omission. Both types of model are discussed 
in Section 11.6. The related problem of sample selectivity, in which certain 
observations are omitted from the sample in a nonrandom way, is dealt with 
in Section 11.7. Finally, in Section 11.8, we discuss duration models, which 
attempt to explain how much time elapses before some event occurs or some 
state changes. 


11.2 Binary Response Models: Estimation 


In a binary response model, the value of the dependent variable y; can take on 
only two values, 0 and 1. Let P, denote the probability that y = 1 conditional 
on the information set Q;, which consists of exogenous and predetermined vari- 
ables. A binary response model serves to model this conditional probability. 
Since the values are 0 or 1, it is clear that P, is also the expectation of y: 
conditional on Q: 


P; = Pr(y = 1|) = E (ye |), 


Thus a binary response model can also be thought of as modeling a conditional 
expectation. 


For many types of dependent variable, we can use a regression model to model 
conditional expectations, but that is not a sensible thing to do in this case. 
Suppose that X; denotes a row vector of length k of variables that belong 
to the information set Q;, almost always including a constant term or the 
equivalent. Then a linear regression model would specify E(y; | Q+) as X6. 
But such a model fails to impose the condition that 0 < E(y |Q) < 1, which 
must hold because E (y: | Q+) is a probability. Even if this condition happened 
to hold for all observations in a particular sample, it would always be easy 
to find values of X; for which the estimated probability X; B would be less 
than 0 or greater than 1. 


Since it makes no sense to have estimated probabilities that are negative or 
greater than 1, simply regressing y; on X; is not an acceptable way to model 
the conditional expectation of a binary variable. However, as we will see in 
the next section, such a regression can provide some useful information, and 
it is therefore not a completely useless thing to do in the early stages of an 
empirical investigation. 


Any reasonable binary response model must ensure that E(y; | Q+) lies in the 
0-1 interval. In principle, there are many ways to do this. In practice, however, 
two very similar models are widely used. Both of these models ensure that 
0 < P, < 1 by specifying that 


P, = Ely: | ù) = F(X:8). (11.01) 
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Here X; is an index function, which maps from the vector X; of explanatory 
variables and the vector @ of parameters to a scalar index, and F(x) is a 
transformation function, which has the properties that 


dF (x) 


F(—œ)=0, F(œ)=1, and f(z)= ae 


0; (11.02) 


These properties are, in fact, just the defining properties of the CDF of a 
probability distribution; recall Section 1.2. They ensure that, although the 
index function X;@ can take any value on the real line, the value of F'(X;(3) 
must lie between 0 and 1. 

The properties (11.02) also ensure that F(x) is a nonlinear function. Con- 
sequently, changes in the values of the X;;, which are the elements of X;, 
necessarily affect E(y,|Q) in a nonlinear fashion. Specifically, when P, is 
given by (11.01), its derivative with respect to Xs is 


OP, — OF(X;) 
OX, OX 


= f( XB), (11.03) 


where (3; is the it? element of 3. Therefore, the magnitude of the derivative 
is proportional to f(X;3). For the transformation functions that are almost 
always employed, f(X;@) achieves a maximum at X;@ = 0 and then falls as 
| X;(| increases; for examples, see Figure 11.1 below. Thus (11.03) tells us 
that the effect on P, of a change in one of the independent variables is greatest 
when P, = 0.5 and very small when P; is close to 0 or 1. 


The Probit Model 


The first of the two widely-used choices for F(x) is the cumulative standard 
normal distribution function, 


(x) = — exp(—3.X7) dX. 


When F'(X;3) = ®(X;3), (11.01) is called the probit model. Although there 
exists no closed-form expression for ®(), it is easily evaluated numerically, 
and its first derivative is, of course, simply the standard normal density func- 
tion, (x), which was defined in expression (1.06). 


One reason for the popularity of the probit model is that it can be derived 
from a model involving an unobserved, or latent, variable y?. Suppose that 


Yr = XB + Ut, Utp ~ NID(0, 1). (11.04) 


We observe only the sign of y?, which determines the value of the observed 
binary variable y; according to the relationship 


y= 1 if yy > 0; y%=0 if y? <0. (11.05) 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


446 Discrete and Limited Dependent Variables 


Together, (11.04) and (11.05) define what is called a latent variable model. 
One way to think of y? is as an index of the net utility associated with some 
action. If the action yields positive net utility, it will be undertaken; otherwise, 
it will not be undertaken. Because we observe only the sign of y?, we can 
normalize the variance of u; to be unity. If the variance of us were some other 
value, say o°, we could divide 6, y?, and us by o. Then u;/o would have 
variance 1, but the value of y, would be unchanged. Another way to express 
this property is to say that the variance of u is not identified by the binary 
response model. 


We can now compute P,, the probability that y, = 1. It is 
Pr(y = 1) = Pr(y;? >0)=PrLGe+ u > 0) 


(11.06) 
= Pr(uz > —X; 3) = Pr(uz < X;/3) = 0(X;(). 


The second-last equality in (11.06) makes use of the fact that the standard 
normal density function is symmetric around zero. The final result is just 
what we would get by letting ®(X;3) play the role of the transformation 
function F'(X;3) in (11.01). Thus we have derived the probit model from the 
latent variable model that consists of (11.04) and (11.05). 


The Logit Model 


The logit model is very similar to the probit model. The only difference is 
that the function F(x) is now the logistic function 


1 e” 


A = = 11.07 
(a) l+e™ 1+e%’ ( ) 
which has first derivative 

ev 


This first derivative is evidently symmetric around zero, which implies that 
A(—x) = 1— A(x). A graph of the logistic function, as well as of the standard 
normal distribution function, is shown in Figure 11.1 below. 


The logit model is most easily derived by assuming that 


P; 
l = X, 
of = z) tÊ, 


which says that the logarithm of the odds (that is, the ratio of the two prob- 
abilities) is equal to X; 8. Solving for P;, we find that 


_  exp(XB) 1 
~ 1+exp(X,8) 1+ exp(—X;) 


This result is what we would get by letting A(X;() play the role of the 
transformation function F'(X;) in (11.01). 


= A(X;8). 


t 
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Maximum Likelihood Estimation of Binary Response Models 


By far the most common way to estimate binary response models is to use the 
method of maximum likelihood. Because the dependent variable is discrete, 
the likelihood function cannot be defined as a joint density function, as it 
was in Chapter 10 for models with a continuously distributed dependent vari- 
able. When the dependent variable can take on discrete values, the likelihood 
function for those values should be defined as the probability that the value 
is realized, rather than as the probability density at that value. With this 
redefinition, the sum of the possible values of the likelihood is equal to 1, just 
as the integral of the possible values of a likelihood based on a continuous 
distribution is equal to 1. 


If, for observation t, the realized value of the dependent variable is y;, then the 
likelihood for that observation if y = 1 is just the probability that y = 1, and 
if y = 0, it is the probability that y, = 0. The logarithm of the appropriate 
probability is then the contribution to the loglikelihood made by observation t. 


Since the probability that y = 1 is F'(X;@), the contribution to the loglike- 
lihood function for observation t when y; = 1 is log F(X;). Similarly, the 
contribution to the loglikelihood function for observation t when y; = 0 is 
log (1 — F(X, B)). Therefore, if y is an n-vector with typical element y;, the 
loglikelihood function for y can be written as 


lly, B) = J (ulog F(X) + (1 - v) log(— F(X). (11.09) 


t=1 


For each observation, one of the terms inside the large parentheses is always 0, 
and the other is always negative. The first term is 0 whenever y4 = 0, and 
the second term is 0 whenever y = 1. When either term is nonzero, it must 
be negative, because it is equal to the logarithm of a probability, and this 
probability must be less than 1 whenever X; is finite. For the model to fit 
perfectly, F'(X;) would have to equal 1 when y; = 1 and 0 when y; = 0, and 
the entire expression inside the parentheses would then equal 0. This could 
happen only if X;3 = œ whenever y, = 1, and X; = —co whenever y = 0. 
Therefore, we see that (11.09) is bounded above by 0. 


Maximizing the loglikelihood function (11.09) is quite easy to do. For the logit 
and probit models, this function is globally concave with respect to @ (see 
Pratt, 1981, and Exercise 11.1). This implies that the first-order conditions, 
or likelihood equations, uniquely define the ML estimator 8, except for one 
special case we consider in the next subsection but one. These likelihood 
equations can be written as 


Z (y: 7 F(X,8)) f( X18) Xi 
> 


t=1 F(X;,8)(1— F(X;8)) =0, 1=1,...,k. (11.10) 


There are many ways to find Â in practice. Because of the global concavity 
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of the loglikelihood function, Newton’s Method generally works very well. 
Another approach, based on an artificial regression, will be discussed in the 
next section. 


Conditions (11.10) look just like the first-order conditions for weighted least 
squares estimation of the nonlinear regression model 


yt = (XB) + vt, (11.11) 


where the weight for observation t is 


—1/2 
. (11.12) 


(F(X,8)(1- F(%6))) 
This weight is one over the square root of the variance of v; = y — F'( X43), 
which is a binary random variable. By construction, v; has mean 0, and its 
variance is 


E(v?) = E(u — F(X) 
F(X,8)(1— F(X8)) + (1 — F(X:B)) (F(X) 
F 


(La-a): (11.13) 


Notice how easy it is to take expectations in the case of a binary random 
variable. There are just two possible outcomes, and the probability of each of 
them is specified by the model. 


Because the variance of v; in regression (11.11) is not constant, applying 
nonlinear least squares to that regression would yield an inefficient estimator 
of the parameter vector B. ML estimates could be obtained by applying 
iteratively reweighted nonlinear least squares. However, Newton’s method, or 
a method based on the artificial regression to be discussed in the next section, 
is more direct and usually much faster. 


Since the ML estimator is equivalent to weighted NLS, we can obtain it as 
an efficient GMM estimator. It is quite easy to construct elementary zero 
functions for a binary response model. The obvious function for observation t 
is y — F(X,). The covariance matrix of the n-vector of these zero functions 
is the diagonal matrix with typical element (11.13), and the row vector of 
derivatives of the zero function for observation t is —f(X,3)X;. With this 
information, we can set up the efficient estimating equations (9.82). As readers 
are asked to show in Exercise 11.3, these equations are equivalent to the 
likelihood equations (11.10). 


Intuitively, efficient GMM and maximum likelihood give the same estimator 
because, once it is understood that the y are binary variables, the elementary 
zero functions serve to specify the probabilities Pr(y, = 1), and they thus 
constitute a full specification of the model. 
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Figure 11.1 Alternative choices for F(x) 


Comparing Probit and Logit Models 


In practice, the probit and logit models generally yield very similar predicted 
probabilities, and the maximized values of the loglikelihood function (11.09) 
for the two models therefore tend to be very close. A formal comparison of 
these two values is possible. If twice the difference between them is greater 
than 3.84, the .05 critical value for the y?(1) distribution, then we can reject 
whichever model fits less well at the .05 level. Such a procedure was discussed 
in Section 10.8 in the context of linear and loglinear models. In practice, 
however, experience shows that this sort of comparison rarely rejects either 
model unless the sample size is quite large. 


In most cases, the only real difference between the probit and logit models is 
the way in which the elements of 8 are scaled. This difference in scaling occurs 
because the variance of the distribution for which the logistic function is the 
CDF can be shown to be 77/3, while that of the standard normal distribution 
is, of course, unity. The logit estimates therefore all tend to be larger in 
absolute value than the probit estimates, although usually by a factor that 
is somewhat less than 7/ V3. Figure 11.1 plots the standard normal CDF, 
the logistic function, and the logistic function rescaled to have variance unity. 
The resemblance between the standard normal CDF and the rescaled logistic 


1 This assumes that there exists a comprehensive model, with a single additional 
parameter, which includes the probit and logit models as special cases. It is 
not difficult to formulate such a model; see Exercise 11.4. 
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function is striking. The main difference is that the rescaled logistic function 
puts more weight in the extreme tails. 


The Perfect Classifier Problem 


We have seen that the loglikelihood function (11.09) is bounded above by 0, 
and that it achieves this bound if X;@ = —oo whenever y; = 0 and X; 8 = co 
whenever y; = 1. Suppose there is some linear combination of the independent 
variables, say X;(°, such that 


Yt = 0 whenever X;3° <0, and 
(11.14) 
Yt = 1 whenever X;(° > 0. 


When this happens, there is said to be complete separation of the data. In 
this case, it is possible to make the value of ¢(y, 3) arbitrarily close to 0 by 
setting G = 78° and letting y — co. This is precisely what any nonlinear 
maximization algorithm will attempt to do if there exists a vector 8° for 
which conditions (11.14) are satisfied. Because of the limitations of computer 
arithmetic, the algorithm will eventually terminate with some sort of numeri- 
cal error at a value of the loglikelihood function that is slightly less than 0. If 
conditions (11.14) are satisfied, X;G° is said to be a perfect classifier, since 
it allows us to predict y with perfect accuracy for every observation. 


The problem of perfect classifiers has a geometrical interpretation. In the 
k-dimensional space spanned by the columns of the matrix X formed from 
the row vectors X+, the vector B° defines a hyperplane that passes through 
the origin and that separates the observations for which y = 1 from those for 
which y = 0. Whenever one column of X is a constant, then the separating 
hyperplane can be represented in the (k — 1)-dimensional space of the other 
explanatory variables. If we write 


Xb? = a° + Xps, 


with Xz a 1 x (k — 1) vector, then X;8° = 0 is equivalent to X83 = —a°, 
which is the equation of a hyperplane in the space of the X2 that in general 
does not pass through the origin. This is illustrated in Figure 11.2 for the 
case k = 3. The asterisks, which all lie to the northeast of the separating 
line for which X, 8° = 0, represent the X;2 for the observations with y = 1, 
and the circles to the southwest of the separating line represent them for the 
observations with y, = 0. 


It is clear from Figure 11.2 that, when a perfect classifier occurs, the sepa- 
rating hyperplane is not, in general, unique. One could move the intercept of 
the separating line in the figure up or down a little while maintaining the sep- 
arating property. Likewise, one could swivel the line a little about the point 
of intersection with the vertical axis. Even if the separating hyperplane were 
unique, we could not identify all the components of B. This follows from the 
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Figure 11.2 A perfect classifier yields a separating hyperplane 


fact that the equation X;3° = 0 is equivalent to the equation X;(cB*) = 0 for 
any nonzero scalar c. The separating hyperplane is therefore defined equally 
well by any multiple of B°. Although this suggests that we might be able to 
estimate 8° up to a scalar factor by imposing a normalization on it, there 
is no question of estimating 8° in the usual sense, and inference on it would 
require methods beyond the scope of this book. 


Even when no parameter vector exists that satisfies the inequalities (11.14), 
there may exist a 3° that satisfies the corresponding nonstrict inequalities. 
There must then be at least one observation with y; = 0 and X;38° = 0, and 
at least one other observation with y; = 1 and X; 8° = 0. In such a case, we 
speak of quasi-complete separation of the data. The separating hyperplane is 
then unique, and the upper bound of the loglikelihood is no longer zero, as 
readers are invited to verify in Exercise 11.6. 


When there is either complete or quasi-complete separation, no finite ML 
estimator exists. This is likely to occur in practice when the sample is very 
small, when almost all of the y; are equal to 0 or almost all of them are equal 
to 1, or when the model fits extremely well. Exercise 11.5 is designed to give 
readers a feel for the circumstances in which ML estimation is likely to fail 
because there is a perfect classifier. 


If a perfect classifier exists, the loglikelihood should be close to its upper 
bound (which may be 0 or a small negative number) when the maximization 
algorithm quits. Thus, if the model seems to fit extremely well, or if the algo- 
rithm terminates in an unusual way, one should always check to see whether 
the parameter values imply the existence of a perfect classifier. For a detailed 
discussion of the perfect classifier problem, see Albert and Anderson (1984). 
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11.3 Binary Response Models: Inference 


Inference about the parameters of binary response models is usually based on 
the standard results for ML estimation that were discussed in Chapter 10. It 
can be shown that 


7 =i 
Var ( plim ni/2(8 — Bo)) = plim (£XTY(60)X) : (11.15) 
where X is an n x k matrix with typical row X+, Bo is the true value of 8, 
and Y(8) is an n x n diagonal matrix with typical diagonal element 


F? (XB) 
F(X,)(1— F(X:6)) © 


Not surprisingly, the right-hand side of expression (11.15) looks like the 
asymptotic covariance matrix for weighted least squares estimation, with 
weights (11.12), of the GNR that corresponds to regression (11.11). This 
GNR is 


(B= (11.16) 


yt — F(X,B) = f( XB) Xb + residual. (11.17) 


The factor of f(X;,@) that multiplies all the regressors of the GNR accounts 
for the numerator of (11.16). Its denominator is simply the variance of the 
error term in regression (11.11). Two ways to obtain the asymptotic covar- 
iance matrix (11.15) using general results for ML estimation are explored in 
Exercises 11.7 and 11.8. 


In practice, the asymptotic result (11.15) is used to justify the covariance 
matrix estimator oo : 

Var(@) = (XTYr(Ê) XY, (11.18) 
in which the unknown {po is replaced by B, and the factor of n~', which is 
needed only for asymptotic analysis, is omitted. This approximation may be 
used to obtain standard errors, t statistics, Wald statistics, and confidence 
intervals that are asymptotically valid. However, they will not be exact in 
finite samples. 


It is clear from equations (11.15) and (11.18) that the ML estimator for the 
binary response model gives some observations more weight than others. In 
fact, the weight given to observation t is proportional to the square root of 
expression (11.16) evaluated at 8 = Ê. It can be shown that, for both the 
logit and probit models, the maximum weight will be given to observations 
for which X; 6 = 0, which implies that P, = 0.5, while relatively little weight 
will be given to observations for which P, is close to 0 or 1; see Exercise 11.9. 
This makes sense, since when P, is close to 0 or 1, a given change in X; 68 
will have little effect on P;, while when P, is close to 0.5, such a change will 
have a much larger effect. Thus we see that ML estimation, quite sensibly, 
gives more weight to observations that provide more information about the 
parameter values. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


11.3 Binary Response Models: Inference 453 


Likelihood Ratio Tests 


It is straightforward to test restrictions on binary response models by using 
LR tests. We simply estimate both the restricted and the unrestricted model 
and calculate twice the difference between the two maximized values of the 
loglikelihood function. As usual, the LR test statistic will be asymptotically 
distributed as x?(r), where r is the number of restrictions. 


One especially simple application of this procedure can be used to test whether 
the regressors in a binary response model have any explanatory power at all. 
The null hypothesis is that E (y; | Q+) is a constant, and the ML estimate of this 
constant is just y, the unconditional sample mean of the dependent variable. 
It is not difficult to show that, under the null hypothesis, the loglikelihood 
function (11.09) reduces to 


ng log (g) + n(1— 9) log (1 — 9), (11.19) 


which is very easy to calculate. Twice the difference between the unrestricted 
maximum of the loglikelihood function and the restricted maximum (11.19) 
will be asymptotically distributed as x(k — 1). This statistic is analogous to 
the usual F test for all the slope coefficients in a linear regression model to 
equal zero, and many computer programs routinely compute it. 


An Artificial Regression for Binary Choice Models 


There is a convenient artificial regression for binary response models.” Like the 
Gauss-Newton regression, to which it is closely related, the binary response 
model regression, or BRMR, can be used for a variety of purposes, including 
parameter estimation, covariance matrix estimation, and hypothesis testing. 


The most intuitive way to think of the BRMR is as a modified version of 
the GNR. The ordinary GNR for the nonlinear regression model (11.11) is 
(11.17). However, it is inappropriate to use this GNR, because the error 
terms are heteroskedastic, with variance given by (11.13). We need to divide 
the regressand and regressors of (11.17) by the square root of (11.13) in order 
to obtain an artificial regression that has homoskedastic errors. The result is 
the BRMR, 


Vi (B) (ye — F( XB) = Vy" (B)F(XeB)Xib + residual, (11.20) 


where V;,(3) = F(X,8)(1— F(X;)). 


If the BRMR is evaluated at the vector of ML estimates Ê, it yields the 
covariance matrix E A 


s°(X'Y(B)X) -, (11.21) 


2 This regression was originally proposed, independently in somewhat different 
forms, by Engle (1984) and Davidson and MacKinnon (1984b). 
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where s is the standard error of the artificial regression. Since (11.20) is a GLS 
regression, s will tend to 1 asymptotically, and expression (11.21) is therefore 
a valid way to estimate Var (Ê). However, because there is no advantage to 
multiplying by a random variable that tends to 1, it is better simply to use 
(11.18), which may readily be obtained by dividing (11.21) by s?. 


Like other artificial regressions, the BRMR can be used as part of a numerical 
maximization algorithm, similar to the ones described in Section 6.4. The 
formula that determines 6(;+1), the value of 8 at step j + 1, is 


By +1) = By) + agbo), 


where b:;) is the vector of OLS estimates from the BRMR evaluated at B), 
and aj) may be chosen in several ways. This procedure generally works very 
well, but a modified Newton procedure will usually be even faster. 


The BRMR is particularly useful for hypothesis testing. Suppose that 6 is 
partitioned as [61 | G2], where 6; is a (k — r)-vector and Bz is an r—-vector. If 
6B denotes the vector of ML estimates subject to the restriction that G2 = 0, 
we can test that restriction by running the BRMR 


Vu- R= Pf Xnbl ti Xab + residual, (11:22) 


where F, = F(X,(), fi = f(X:b), and V; = V;(8). Here X; has been parti- 
tioned into two vectors, X;; and X;2, corresponding to the partitioning of 6. 
The regressors that correspond to 3, are orthogonal to the regressand, while 
those that correspond to G2 are not. All the usual test statistics for b2 = O are 
valid. The best test statistic to use in finite samples is probably the explained 
sum of squares from regression (11.22). It will be asymptotically distributed 
as y?(r) under the null hypothesis. An F statistic is also asymptotically valid, 
but since its denominator of s? is random, and there is no need to estimate 
the variance of (11.22), the explained sum of squares is preferable. 


In the special case of the null hypothesis that all the slope coefficients are 
zero, regression (11.22) simplifies dramatically. In this case, X+ is just unity, 
and Vi F,, and f; are all constants that do not depend on t. Since neither 
subtracting a constant from the regressand nor multiplying the regressand 
and regressors by a constant has any effect on the F statistic for bọ = 0, 
regression (11.22) is equivalent to the much simpler regression 


Y = cı + Xocq + residuals. (11.23) 


The ordinary F statistic for c2 = 0 in regression (11.23) is an asymptotically 
valid test statistic for the hypothesis that B2 = 0. The fact that (11.23) is just 
an OLS regression of y on the constant and explanatory variables accounts 
for the claim we made in Section 11.2 that such a regression is not always 
completely useless! 
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Bootstrap Inference 


Because binary response models are fully parametric, it is straightforward to 
bootstrap them using procedures similar to those discussed in Sections 4.6 
and 5.3. For the model specified by (11.01), the bootstrap DGP is required 
to generate binary variables y;, t = 1,...,n, in such a way that 


A 


Pf = E(y | Xi) = F(X), 


where B is a vector of ML estimates, possibly subject to whatever restrictions 
are being tested. In order to generate yž, the easiest way to proceed is to draw 
už from the uniform distribution U(0,1) and set yf = I(uf < Př), where, as 
usual, J(-) is an indicator function. Alternatively, in the case of the probit 
model, we can generate bootstrap samples by using (11.04) to generate latent 
variables and (11.05) to convert these to the binary dependent variables we 
actually need. 


Bootstrap methods for binary response models may or may not yield more 
accurate inferences than asymptotic ones. In the case of test statistics, where 
the bootstrap samples must be generated under the null hypothesis, there 
seems to be evidence that bootstrap P values are generally more accurate 
than asymptotic ones. The value of bootstrapping appears to be particularly 
great when the number of restrictions is large and the sample size is moderate. 
However, in the case of confidence intervals, the evidence is rather mixed. 


The bootstrap can also be used to reduce the bias of the ML estimates. As 
we saw in Section 3.6, regression models tend to fit too well in finite samples, 
in the sense that the residuals tend to be smaller than the true error terms. 
Binary response models also tend to fit too well, in the sense that the fitted 
probabilities, the F'(X;3), tend to be closer to 0 and 1 than the true proba- 
bilities, the F'(X;4o). This overfitting causes the elements of Ê to be biased 
away from zero. 


If we generate B bootstrap samples using the parameter vector Ê, we can 
estimate the bias using 


w| 


a L&a a 
Bias* (6) = > z- oS 
j=1 


where 0 is the estimate of 6B using the j*™ bootstrap sample. Therefore, a 


bias-corrected estimate is 
A A + kL A A 1 * 
Bye = Ê — Bias* (8) = 268 - =) | 8}. 


Simulation results in MacKinnon and Smith (1998), which are by no means 
definitive, suggest that this estimator is less biased and has smaller mean 
squared error than the usual ML estimator. 
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The finite-sample bias of the ML estimator in binary response models can 
cause an important practical problem for the bootstrap. Since the probabil- 
ities associated with B tend to be more extreme than the true ones, samples 
generated using B will be more prone to having a perfect classifier. Therefore, 
even though there is no perfect classifier for the original data, there may well 
be perfect classifiers for some of the bootstrap samples. The simplest way to 
deal with this problem is just to throw away any bootstrap samples for which 
a perfect classifier exists. However, if there is more than a handful of such 
samples, the bootstrap results must then be viewed with skepticism. 


Specification Tests 


Maximum likelihood estimation of binary response models will almost always 
yield inconsistent estimates if the form of the transformation function F'(X;3) 
is misspecified. It is therefore very important to test whether this function 
has been specified correctly. 


In Section 11.2, we derived the probit model by starting with the latent vari- 
able model (11.04), which has normally distributed, homoskedastic errors. A 
more general specification for a latent variable model, which allows for the 
error terms to be heteroskedastic, is 


yp = Xi Btu, uz ~ N(0, exp(2Z7)), (11.24) 


where Z; is a row vector of length r of observations on variables that be- 
long to the information set Q, and y is an r—vector of parameters to be 
estimated along with @. To ensure that both @ and y are identifiable, Z; 
must not include a constant term or the equivalent. With this precaution, the 
model (11.04) is obtained by setting y = 0. Combining (11.24) with (11.05) 


yields the model 
Xıß 
P=E Q) = &| ——— 
t (Yt | t) (4). 


in which P, depends on both the regression function X, 6 and the skedastic 
function exp(2Z;7). Thus it is clear that heteroskedasticity of the uz; in a 
latent variable model will affect the form of the transformation function. 


Even when the binary response model being used is not the probit model, it 
still seems quite reasonable to consider the alternative hypothesis 


P, = (55). (11.25) 


We can test against this alternative by using a BRMR to test the hypothesis 
that y = 0. The appropriate BRMR is 


Vo? (y% —F,)= PPR Xb — V PXB Ze + residual, (11.26) 
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where F,, fs and V; are evaluated at the ML estimates B computed under the 
null hypothesis that y = 0 in (11.25). These are just the ordinary estimates 
for the binary response model defined by P, = F(X); usually they will 
be probit or logit estimates. The explained sum of squares from (11.26) is 
asymptotically distributed as ?(r) under the null hypothesis. 


Heteroskedasticity is not the only phenomenon that may lead the transfor- 
mation function F'\(X;) to be specified incorrectly. Consider the family of 


models for which (6X,) 
T t 
ae) l (11.27) 


P, = E(y | Q2) =r( 
where ô is a scalar parameter, and 7(-) may be any scalar function that is 
monotonically increasing in its argument and satisfies the conditions 


7(0) =0, 7/(0) =1, and 7”(0) 40, (11.28) 


where 7/(0) and T” (0) are the first and second derivatives of T(x), evaluated at 
x =0. The family of models (11.27) allows for a wide range of transformation 
functions. It was considered by MacKinnon and Magee (1990), who showed, 
by using l’Hopital’s Rule, that 


jim ( 75) =x and ym (ACO) = 17?7"(0). (11.29) 


Hence the BRMR for testing the null hypothesis that 6 = 0 is 
Y- BR) = 0 Of Xb + V (XÂ) fid + residual, (11.30) 


where everything is evaluated at the ML estimates B of the ordinary binary 
response model that (11.27) reduces to when ô = 0. The constant factor 
T” (0)/2 that arises from (11.29) is irrelevant for testing and has been omitted. 
Thus regression (11.30) simply treats the squared values of the index function 
evaluated at @ as if they were observations on a possibly omitted regressor, 
and the ordinary t statistic for d = 0 provides an asymptotically valid test.° 


Tests based on the BRMRs (11.26) and (11.30) are valid only asymptotically. 
It is extremely likely that their finite-sample performance could be improved 
by using bootstrap P values instead of asymptotic ones. Since, in both cases, 
the null hypothesis is just an ordinary binary response model, computing boot- 
strap P values by using the procedures discussed in the previous subsection 
is quite straightforward. 


3 There is a strong resemblance between regression (11.30) and the test regression 
for the RESET test (Ramsey, 1969), in which squared fitted values are added 
to an OLS regression as a test for functional form. As MacKinnon and Magee 
(1990) showed, this resemblance is not coincidental. 
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11.4 Models for More than Two Discrete Responses 


Discrete dependent variables that can take on three or more different values 
are by no means uncommon in economics, and a large number of models has 
been devised to deal with such cases. These are sometimes referred to as 
qualitative response models and sometimes as discrete choice models. ‘The 
binary response models we have already studied are special cases. 


Discrete choice models can be divided into two types: ones designed to deal 
with ordered responses, and ones designed to deal with unordered responses. 
Surveys often produce ordered response data. For example, respondents might 
be asked whether they strongly agree, agree, neither agree nor disagree, dis- 
agree, or strongly disagree with some statement. Here there are five possible 
responses, which evidently can be ordered in a natural way. In many other 
cases, however, there is no natural way to order the various choices. A classic 
example is the choice of transportation mode. For intercity travel, people 
often have a choice among flying, driving, taking the train, and taking the 
bus. There is no natural way to order these four choices. 


The Ordered Probit Model 


The most widely-used model for ordered response data is the ordered probit 
model. This model can easily be derived from a latent variable model. The 
model for the latent variable is 


Ye =X:B+um, uw ~ NID(O,1), (11.31) 


which is identical to the latent variable model (11.04) that led to the ordinary 
probit model. As in the case of the latter, what we actually observe is a 
discrete variable y; that can take on a limited, known, number of values. For 
simplicity, we assume that the number of values is just 3. It will be obvious 
how to extend the model to cases in which y; can take on any known number 
of values. 


The relation between the observed variable y and the latent variable yẹ is 
assumed to be given by 


ye = 0 if yf <1; 
yw =1 ify SY < 723 (11.32) 
y = 2 if y? > %2. 


Thus y = 0 for small values of y?, ye = 1 for intermediate values, and y, = 2 
for large values. The boundaries between the three cases are determined by 
the parameters Jı and y2. These threshold parameters, which usually must 
be estimated, determine how the values of y? get translated into the three 
possible values of y. It is essential that yo > y1. Otherwise, the first and last 
lines of (11.32) would be incompatible, and we could never observe y; = 1. 
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If X; contains a constant term, it is impossible to identify the constant along 
with 7; and y2. To see this, suppose that the constant is equal to a. Then 
it is easy to check that y is unchanged if we replace the constant by a+ 6 
and replace y; by y; +ô for i = 1,2. The easiest, but not the only, solution to 
this identification problem is just to set a = 0. We adopt this solution here. 
In general, with no constant, the ordered probit model will have one fewer 
threshold parameter than the number of choices. When there are just two 
choices, the single threshold parameter is equivalent to a constant, and the 
ordered probit model reduces to the ordinary probit model, with a constant. 


In order to work out the loglikelihood function for this model, we need the 
probabilities of the three events y: = 0, y = 1, and y = 2. The probability 
that yz = 0 is 
Pr(yt = 0) = Pr(y; < y1) = Pr( XB + ut < 1) 
= Pr(u < J71 — Xb) = (y — Xp). 
Similarly, the probability that y, = 2 is 
Pr(y = 2) = Pr(yy > y2) = Pr( X: + ut > %2) 
= Pr(u > y2 — Xb) = (X18 — 72). 
Finally, the probability that y, = 1 is 
Pr(y, = 1) = 1—Pr(y = 0) — Pr(y = 2) 
= 1 — (y — Xf) — (X48 — 72) 
= O(72 — XB) — ®(y1 — XB). 


These probabilities depend solely on the value of the index function, X;G, 
and on the two threshold parameters. 


The loglikelihood function for the ordered probit model derived from (11.31) 
and (11.32) is 


(B, 71,72) = X log(®(m — X1B)) + X log(O( XB — +2)) 


yt=0 Yt=2 
(11.33) 
+ X log(®(q2 — X18) — B(y1 — X18). 
y+=1 


Maximizing (11.33) numerically is generally not difficult to do, although steps 
may have to be taken to ensure that y2 is always greater than y1. Note that 
the function ® in (11.33) may be replaced by any function F that satisfies the 
conditions (11.02), although it may then be harder to derive the probabilities 
from a latent variable model. Thus the ordered probit model is by no means 
the only qualitative response model for ordered data. 
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The ordered probit model is widely used in applied econometric work. A 
simple, graphical exposition of this model is provided by Becker and Kennedy 
(1992). Like the ordinary probit model, the ordered probit model can be 
generalized in a number of ways; see, for example, Terza (1985). An interesting 
application of a generalized version, which allows for heteroskedasticity, is 
Hausman, Lo, and MacKinlay (1992). They apply the model to price changes 
on the New York Stock Exchange at the level of individual trades. Because 
the price change from one trade to the next almost always takes on one of a 
small number of possible values, an ordered probit model is an appropriate 
way to model these changes. 


The Multinomial Logit Model 


The key feature of ordered qualitative response models like the ordered probit 
model is that all the choices depend on a single index function. This makes 
sense only when the responses have a natural ordering. A different sort of 
model is evidently necessary to deal with unordered responses. The most 
popular of these is the multinomial logit model, sometimes called the multiple 
logit model, which has been widely used in applied work. 


The multinomial logit model is designed to handle J +1 responses, for J > 1. 
According to this model, the probability that any one of them is observed is 


exp (Wi') 


Pr(y = 1) = 3 
T Yt a exp (WB) 


for b=, d: (11.34) 


Here W;; is a row vector of length k; of observations on variables that belong 
to the information set of interest, and 8% is a k;-vector of parameters, usually 
different for each 7 = 0,..., J. 


Estimation of the multinomial logit model is reasonably straightforward. The 
loglikelihood function can be written as 


n J J 
2 (> (ye = 3) Wij 84 — log (- exp( WB), (11.35) 


t=1 \j=0 j=0 


where J(-) is the indicator function. Thus each observation contributes two 
terms to the loglikelihood function. The first is W,; BÍ, where y; = j, and the 
second is minus the logarithm of the denominator that appears in (11.34). It 
is generally not difficult to maximize (11.35) by using some sort of modified 
Newton method, provided there are no perfect classifiers, since the loglikeli- 
hood function (11.35) is globally concave with respect to the entire vector of 
parameters, [G° :... į 87]; see Exercise 11.16. 


Some special cases of the multinomial logit model are of interest. One of these 
arises when the explanatory variables W,; are the same for each choice j. If 
a model is intended to explain which of an unordered set of outcomes applies 
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to the different individuals in a sample, then the probabilities of all of these 
outcomes can be expected to depend on the same set of characteristics for 
each individual. For instance, a student wondering how to spend Saturday 
night may be able to choose among studying, partying, visiting parents, or 
going to the movies. In choosing, the student takes into account things like 
grades on the previous midterm, the length of time since the last visit home, 
the interest of what is being shown at the local movie theater, and so on. All 
these variables affect the probability of each possible outcome. 


For models of this sort, it is not possible to identity J+1 parameter vectors 3’, 
j = 0,...,J. To see this, let X, denote the common set of explanatory 
variables for observation t, and define yÍ = BI — B? for j = 1,...,J. On 
replacing the W,; by X; for all j, the probabilities defined in (11.34) become, 
forl = lreid 


_ p&p) ep’) 
ey exp(X; 3’) Ler ae exp (X17’) 


where the second equality is obtained by dividing both the numerator and the 
denominator by exp(X;°). For outcome 0, the probability is just 
ih 


Pry, = 0) = = 
1+ i exp (X+7y’) 


It follows that all J +1 probabilities can be expressed in terms of the para- 
meters yÍ, j = 1,...,J, independently of 8°. In practice, it is easiest to 
impose the restriction that 3° = 0, which is then enough to identify the para- 
meters BÍ, j =1,...,J. When J = 1, it is easy to see that this model reduces 
to the ordinary logit model with a single index function X; 61. 


In certain cases, some but not all of the explanatory variables are common to 
all outcomes. In that event, for the common variables, a separate parameter 
cannot be identified for each outcome, for the same reason as above. In order 
to set up a model for which all the parameters are identified, it is necessary to 
set to zero those components of 3° that correspond to the common variables. 
Thus, for instance, at most J of the W,; vectors can include a constant. 


Another special case of interest is the so-called conditional logit model. For 
this model, the probability that agent t makes choice I is 


exp (Wii) 


Pry =l) = ; 
Z jo exp (Wi) 


(11.36) 


where W;; is a row vector with k components for each j = 0,..., J, and Bisa 
k-vector of parameters, the same for each j. This model has been extensively 
used to model the choice among competing modes of transportation. The 
usual interpretation is that the elements of W;; are the characteristics of 
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choice j for agent t, and agents make their choice by considering the weighted 
sums W,,; of these characteristics. 


It is necessary that none of the explanatory variables in the W;; vectors should 
be the same for all J + 1 choices. In other words, no single variable should 
appear in each and every W,;. It is easy to see from (11.36) that, if there 
were such a variable, say w, for some i = 1,...,k, then this variable would 
be multiplied by the same parameter (3; for each choice. In consequence, 
the factor exp(w,;0;) would appear in the numerator and in every term of the 
denominator of (11.36) and could be cancelled out. This implies, in particular, 
that none of the explanatory variables can be constant for allt = 1,...,n and 
all Sy oes yal: 


An important property of the general multinomial logit model defined by the 
set of probabilities (11.34) is that 


Pr(y,=1)  exp(Waß'’) 


Pr( =j)  exp(W,j?) ’ 


for any two responses l and j. Therefore, the ratio of the probabilities of any 
two responses depends solely on the explanatory variables W;; and W;; and 
the parameters 3! and BÍ associated with those two responses. It does not 
depend on the explanatory variables or parameter vectors specific to any of 
the other responses. This property of the model is called the independence of 
irrelevant alternatives, or IIA, property. 


The IIA property is often quite implausible. For example, suppose there are 
three modes of public transportation between a pair of cities: the bus, which 
is slow but cheap, the airplane, which is fast but expensive, and the train, 
which is a little faster than the bus and a lot cheaper than the airplane. 
Now consider what the model says will happen if the rail line is upgraded, 
causing the train to become much faster but considerably more expensive. 
Intuitively, we might expect a lot of people who previously flew to take the 
train instead, but relatively few to switch from the bus to the train. However, 
this is not what the model says. Instead, the IIA property implies that the 
ratio of travelers who fly to travelers who take the bus is the same whatever 
the characteristics of the train. 

Although the ITA property is often not a plausible one, it can easily be tested; 
see Hausman and McFadden (1984), McFadden (1987), and Exercise 11.22. 
The simplicity of the multinomial logit model, despite the IIA property, makes 
this model very attractive for cases in which it does not appear to be incom- 
patible with the data. 


The Nested Logit Model 


A discrete choice model that does not possess the IIA property is the nested 
logit model. For this model, the set of possible choices is decomposed into 
subsets. Let the set of outcomes {0,1,...,J} be partitioned into m disjoint 
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subsets A;, i = 1,...,m. The model then supposes that, conditional on 
choosing an outcome in subset A;, the choice among the members of A; is 
governed by a standard multinomial logit model. We have, for 7 € A;, that 


_; a __ exp(Wi587/9;) 
Pr(ye =j ye € Ai) = Sica, oxp( Wu B'/8,) (11.37) 


It is clear that the parameter 0;, which can be thought of as a scale para- 
meter for the parameter vectors 3’, j € Aj, is not identifiable on the basis of 
choice within the elements of subset A;. However, it is what determines the 
probability of choosing some element in A;. Specifically, we assume that 


exp(0;hzi) 
P Ai) = Sa l 11.38 
Hee > p= EXP(Pxhex) ee 


where we have defined the inclusive value of subset A; as: 


hg = log ( X. exp(Wi;8/6)). (11.39) 


JEAi 
Since it follows at once from (11.38) that 5°", Pr(y: € Ai) = 1, we can see 
that y, must belong to one of the disjoint sets Aj. 


By putting together (11.37) and (11.38), we obtain the J + 1 probabilities 
for the different outcomes. For each j = 0,...,J, let i(j) be the subset 
containing j. In other words, j € Aj,(;). Then we have that 


Pr(y = j) = Pr(ye = j | yt © Aig) Pr (ye E€ Aig) 


exp(W,; 87/05) exp(i(j) Reig) ) 


= = (11.40) 
De Ai) CXP( Wi B'/9i(3)) pai &XP(Pxhtk) 


It is not hard to check that, if 6; = 1 for all i = 1,...,m, the probabili- 
ties (11.40) reduce to the probabilities (11.34) of the usual multinomial logit 
model; see Exercise 11.17. Thus the multinomial logit model is contained 
within the nested logit model as a special case. It follows, therefore, that 
testing the multinomial logit model against the alternative of the nested logit 
model, for some appropriate choice of the subsets A;, is one way to test 
whether the IIA property is compatible with the data. 


An Artificial Regression for Discrete Choice Models 


In order to perform the test of the IIA property mentioned just above, and 
to perform inference generally in the context of discrete choice models, it is 
convenient to be able to make use of an artificial regression. The simplest 
such artificial regression was proposed by McFadden (1987) for multinomial 
logit models. In this section, we present a generalized version that can be 
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applied to any discrete choice model. We call this the discrete choice artificial 
regression, or DCAR. 


As usual, we assume that there are J + 1 possible outcomes, numbered from 
j =0 to j = J. Let the probability of choosing outcome j for observation t be 
given by the function II,;(0), where 0 is a k-vector of parameters. For the 
multinomial logit model, 0 would include all of the independent parameters in 
the set of parameter vectors 3/, j =0,..., J. The function I;;(-) will usually 
also depend on exogenous or predetermined explanatory variables that are 
not made explicit in the notation. We require that La II,;(@) = 1 for all 
t =1,...,n and for all admissible parameter vectors 0, in order that the set 
of J +1 outcomes should be exhaustive. 


For each observation t, t = 1,...,n, define the J+ 1 indicator variables d+; as 
dij = I(yz = j). Then the loglikelihood function of the discrete choice model 


is given by 
n J 
9) =X X dij log 11;(8). (11.41) 


t=1 j=0 


Just as for the loglikelihood functions (11.09) and (11.35), the contribution 
made by observation t is the logarithm of the probability that y should have 
taken on its observed value. 


The DCAR has n(J + 1) “observations,” J+ 1 for each real observation. For 
observation t, the J+1 components of the regressand, evaluated at 0, are given 
by I; u (8) (diy —II,;(@)), j =0,..., J. The components of the regressor cor- 
seeped to parameter 6;, i =1,...,k, are given by 17}? (8) OIL; (8) /00;. 
Thus the DCAR may be written as 


Hz? (0) (dij — Hi (0)) = O7}? (0)T,;(0)b + residual, (11.42) 


for t = 1,...,n and j = 0,...,J. Here T;;(0) denotes the 1 x k vector of 
the partial derivatives of II,;(@) with respect to the components of 0, and, as 
usual, b is a k—vector of artificial parameters. It is easy to see that the scalar 
product of the regressand and the regressor corresponding to 0; is 


n J 
dz; — H (0)) 3l; (0)/30; 
D E A (11.43) 
t=1 j=0 T143(8) 
The derivative of the loglikelihood function (11.41) with respect to 6; is 
UI q, 2T (0)/08: 
t=1 TO Hy(8) 
j=0 


and we can see that this is equal to (11.43), because differentiating the identity 
Yo ee) = 1 with respect to 0; shows that 2a oll; (0)/ð0; = 0. It 
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follows that the regressand is orthogonal to all the regressors when all the 
artificial variables are evaluated at the maximum likelihood estimates 0. 


In Exercises 11.18 and 11.19, readers are asked to show that regression (11.42), 
the DCAR, satisfies the other requirements for an artificial regression used 
for hypothesis testing, as set out in Exercise 8.20. See also Exercise 11.22, in 
which readers are asked to implement by artificial regression the test of the 
IIA property discussed at the end of the previous subsection. 


As with binary response models, it is easy to bootstrap discrete choice models, 
because they are fully parametrically specified. For the model characterized 
by the loglikelihood function (11.41), an easy way to implement the bootstrap 
DGP is, first, to construct the cumulative probabilities P,;(@) = ~/_5 T1,;(8), 
for j = 0,...,J —1, and then to draw a random number, už say for obser- 
vation t, from the uniform distribution U(0,1). The bootstrap dependent 
variable y; is then set equal to 


All of the indicator functions in the above sum are zero if uj < Pi (9) = 
T49(@), an event which occurs with probability I+o(0), as desired. Similarly, 
y = j for j =1,...,J if and only if Pyj_1)(@) < uf < P,j(@), an event that 


A A 


occurs with probability IT,;(@) = P; (0) — Pij—1) (9). 


The Multinomial Probit Model 


Another discrete choice model that can sometimes be used when the HA 
property is unacceptable is the multinomial probit model. This model is 
theoretically attractive but computationally burdensome. The J+ 1 possible 
outcomes are generated by the latent variable model 


Yi = Wij) + uj, use ~ N(0, Q), (11.44) 


where the y?; are not observed, and us is a 1 x (J + 1) vector with typical 
element utj. What we observe are the binary variables y,;, which are assumed 
to be determined as follows: 


Yo = l if yh — Ya > 0 for alli = 0,..., J, 
j id: (11.45) 
Ytj = 0 otherwise. 


As with the multinomial logit model, separate coefficients cannot be identified 
for all J +1 outcomes if an explanatory variable is common to all of the index 
functions W,,; BÍ. The solution to this problem is the same as before: We set 
the components of 3° equal to 0 for all such variables. 
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It is clear from (11.45) that the observed y+; depend only on the differences 
Yr; — Yio. J =1,..., J. Let zp; be equal to this difference. Then 


Wo = Lii; >2,, foralls = eJ; and zrez 0; 
i an 5 (11.46) 
Ytj = 0 otherwise. 


Thus the probabilities Pr(y:; = 1) are completely determined by the joint dis- 
tribution of the z?;. We write the covariance matrix of this distribution as X, 
where X isa J x J symmetric positive definite matrix, uniquely determined 
by the (J +1) x (J +1) matrix Q of (11.44), although 2 is not uniquely 
determined by X. It follows that the matrix Q cannot be identified on the 
basis of the observed variables y, alone. 


In fact, even X is identified only up to scale. This can be seen by observing 
that, if all the ae in (11.46) are multiplied by the same positive constant, the 
values of the y;; remain unchanged. In practice, it is customary to set the first 
diagonal element of X equal to 1 in order to set the scale of X. Once the scale 
is fixed, then the only other restriction on X is that it must be symmetric 
and positive definite. In particular, it may well have nonzero off-diagonal 
elements, and these give the multinomial probit model a flexibility that is 
not shared by the multinomial logit model. In consequence, the multinomial 
probit model does not have the IIA property. 


The latent variable model (11.44) can be interpreted as a model determining 
the utility levels yielded by the different outcomes. Then the correlation 
between zp; and z?;, for i # j, might measure the extent to which a preference 
for flying over driving, say, is correlated with a preference for taking the train 
over driving. In this example of transportation mode choice, we are assuming 
that driving is outcome 0. It seems fair to say that, although these correlations 
are what provides multinomial probit with greater flexibility than multinomial 
logit, they are a little difficult to interpret directly. 


Unfortunately, the multinomial probit model is not at all easy to estimate. 
The event yz; = 1 will be observed if and only if y?; — yz, = 0 for allt = 
1,...,/ +1, and the probability of this event is given by a J—dimensional 
integral. In order to evaluate the loglikelihood function just once, the inte- 
gral corresponding to whatever event occurred must be computed for every 
observation in the sample. This must generally be done a large number of 
times during the course of whatever nonlinear optimization procedure is used. 
Evaluating high-dimensional integrals of the normal distribution is analyti- 
cally intractable. Therefore, except when J is very small, the multinomial 
probit model is usually estimated by simulation-based methods, including the 
method of simulated moments, which was discussed in Section 9.6. See Haji- 
vassiliou and Ruud (1994) and Gouriéroux and Monfort (1996) for discussions 
of some of the methods that have been proposed. 


The treatment of qualitative response models in this section has necessarily 
been incomplete. Detailed surveys of the older literature include Amemiya 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


11.5 Models for Count Data 467 


(1985, Chapter 9) and McFadden (1984). For a more up-to-date survey, but 
one that is relatively superficial, see Maddala and Flores-Lagunes (2001). 


11.5 Models for Count Data 


Many economic variables are nonnegative integers. Examples include the 
number of patents granted to a firm and the number of visits to the hospital 
by an individual, where each is measured over some period of time. Data of 
this type are called event count data or, simply, count data. In many cases, 
the count is 0 for a substantial fraction of the observations. 


One might think of using an ordered discrete choice model like the ordered 
probit model to handle data of this type. However, this is usually not ap- 
propriate, because such a model requires the number of possible outcomes to 
be fixed and known. Instead, we need a model for which any nonnegative 
integer value is a valid, although perhaps very unlikely, value. One way to 
obtain such a model is to start from a distribution which has this property. 
The most popular distribution of this type is the Poisson distribution. If a 
discrete random variable Y follows the Poisson distribution, then 


eTA 


Pr(Y =y) =— 
y. 


x WeDo ews (11.47) 
This distribution is characterized by a single parameter, A. It can be shown 
that the probabilities (11.47) sum to 1 over y = 0,1,2,..., and that the mean 
and the variance of a Poisson random variable are both equal to A, which 
must therefore take on only positive values; see Exercise 11.23. 


The Poisson Regression Model 


The simplest model for count data is the Poisson regression model, which is 
obtained by replacing the parameter in (11.47) by a nonnegative function 
of regressors and parameters. The most popular choice for this function is the 
exponential mean function 


X,(B) = exp( X: 8), (11.48) 


which makes use of the linear index function X; 6. Other specifications for the 
index function, possibly nonlinear, can also be used. Because the linear index 
function in (11.48) is the argument of an exponential, the model specified 
by (11.48) is sometimes called loglinear, since the log of A,() is linear in 73. 
For any valid choice of A;(3), we obtain the Poisson regression model 


Pr(Y; = y) = A > gai ie (11.49) 


If the observed count value for observation t is y,, then the contribution to 
the loglikelihood function is the logarithm of the right-hand side of (11.49), 
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evaluated at y = y+. Therefore, the entire loglikelihood function is 


L(y, B) = Ss (- exp(X;3) + y Xıb — log yt!) (11.50) 


t=1 
under the exponential mean specification (11.48). 


Maximizing (11.17) is not difficult. The likelihood equations are 


re = X (ye — exp(X;3)) X: =). (11.51) 
t=1 
and the Hessian matrix is 
H(8) =—5_ exp(X,B) XX; = —X'Y(B)X, (11.52) 


t=1 


where Y (8) is an nxn diagonal matrix with typical diagonal element equal to 
Y;(B) = exp( X: 6). Since H() is negative definite, optimization techniques 
based on Newton’s Method generally work very well. Inferences may be based 
on the standard asymptotic result (10.41) that the asymptotic covariance 
matrix is equal to the inverse of the information matrix. This leads to the 
estimator 


Var (@) = (XTX, (11.53) 


where Y = Y(G). This estimated covariance matrix looks very much like 
the one for weighted least squares estimation. In fact, if we were to run the 
nonlinear regression 

Yt = exp( Xib) + ut (11.54) 


by weighted least squares, using weights el (B) = exp(— 4 X; 6), the first- 
order conditions, treating the weights as fixed, would be equations (11.51). 
Regression (11.54) is the analog for the Poisson regression model of regression 
(11.11) for the binary response model. Thus ML estimation of the Poisson 
regression model specified by (11.49), where A;() is given by an exponential 
mean function, is seen to be equivalent to weighted NLS estimation of the 
nonlinear regression model (11.54). 

The weighted NLS interpretation suggests that an artificial regression must 
be available. This is indeed the case. Just as the BRMR (11.20) is the GNR 
that corresponds to the weighted version of (11.11), the artificial regression 
for the Poisson regression model is the GNR that corresponds to the weighted 
version of (11.54): 


exp(— 5X73) (y: — exp(X+6)) = exp(4 X6) X;b + residual. (11.55) 


Like the GNR. and the BRMR, this regression may be used for a number of 
purposes, including estimating the covariance matrix of Ê. It is particularly 
useful for testing restrictions on 8 without having to estimate the model more 
than once; see Exercise 11.25. 
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Testing for Overdispersion in the Poisson Regression Model 


Although its simplicity makes it attractive, the Poisson regression model is 
rarely entirely satisfactory. In practice, even though it may predict the mean 
event count accurately, it frequently tends to underpredict the frequency of 
zeros and large counts, because the variance of the actual data is larger than 
the variance predicted by the Poisson model. This failure of the model is called 
overdispersion. Before accepting a Poisson regression model, even tentatively, 
it is highly advisable to test it for overdispersion. 


Several tests for overdispersion have been proposed. The simplest of these 
are based on the artificial OPG regression that we introduced in Section 10.5 
for models estimated by maximum likelihood. The regressand of the OPG 
regression is equal to 1 for each observation, and the regressors are the partial 
derivatives of the loglikelihood contribution with respect to the parameters. 
Thus observation t of the OPG regression based on the loglikelihood function 
(11.50) can be written as 


|= (y: = exp( X: 6)) Xb + residual. (11.56) 


When the regressors in (11.56) are evaluated at the ML estimates Ê, they are 
orthogonal to the regressand. 


If the variance of y is indeed equal to exp(X;(3), its mean according to the 
loglinear Poisson regression model, then the quantity 


24(B) = (ye — exp(XiB)) — y (11.57) 


has expectation 0.4 We can test whether the expectation is really zero by 
running the OPG regression (11.56), adding an extra regressor with typical 
element z;(3). Both n minus the sum of squared residuals from this aug- 
mented OPG regression and the t statistic associated with the extra regressor 
provide asymptotically valid test statistics; the former is asymptotically dis- 
tributed as y?(1) under the null hypothesis, while the latter is asymptotically 


distributed as N(0, 1). 


Testing can be made a little simpler if we note that the extra regressor (11.57) 
is uncorrelated with the regressors in (11.56) under the null. This is a simple 
consequence of the fact, which readers are asked to demonstrate in Exercise 
11.24, that the third central moment of the Poisson distribution with para- 
meter A is equal to A. We may write the testing OPG regression as 


t = Ĝb + c2 + residuals, (11.58) 


4 The quantity (yt — exp(X;3))° — exp(X;3) also has expectation 0 and could 
be used in place of (11.57) in an OPG test regression. However, the simplifica- 
tions that are discussed below would not be possible if the test regressor were 
redefined in this way. 
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where + is an n-vector of 1s, the matrix G= G(8) contains the regressors 
of (11.56) evaluated at Ø, and 2 = z(Q) is the extra regressor, with typical 


element z;(G). By the FWL Theorem, a test of the hypothesis that c = 0 can 
equally well be performed by running the FWL regression 


Meu = cMgé + residuals, (11.59) 


where Mg is the orthogonal projection matrix that projects on to the orthog- 
onal complement of the span of the columns of G But, since those columns 
are orthogonal to ų, the regressand of (11.59) is just ¿. In addition, because 
z(3) is uncorrelated with the columns of G(@), the regressor is asymptotic- 
ally equal to 2. Therefore, regressions (11.58) and (11.59) are asymptotically 
equivalent to a regression of ¿ on Z. Once again, either the explained sum of 
squares or the t statistic for c = 0 yields an asymptotically valid test. 


In Exercise 4.8, we saw that every t statistic is the cotangent of a certain 
angle, namely, the angle between the regressand and the regressor of the 
FWL regression that can be used to compute the statistic. Since this angle 
does not depend on which vector is the regressor and which vector is the 
regressand, this result implies that the t statistic from regressing + on Z is 
identical to the t statistic from regressing 2 on ų. If we run the regression in 
this direction, however, we will not obtain the same ESS. Nevertheless, the 
ESS can be used as a valid statistic if the variables are scaled by estimates 
of the standard deviations of the elements of z(3). This rescaling yields the 
artificial regression that is most commonly used to test for overdispersion in 
the Poisson regression model. 


Observe that, if Y is a random variable which follows the Poisson distribution 
with parameter A, 


B(((Y —)?-¥)’) =B(((v -4 -Y - a) -9)") 
= E((Y — A)*) + E((Y — A)?) +? 
— 2E((Y — A)®)) — 2AE(Y — \)”) — 2AE(Y — A) 
=+3\7 +A 4A? — 2A — 2)? = 2?, 


where we have used the result of Exercise 11.24 for both the third and fourth 
central moments of the Poisson distribution. A suitable testing regression 
with scaled variables can therefore be written as 


5 exp(—X,)z,(B) = eo exp(—X;,8)c + residual, (11.60) 


and either the t statistic or the ESS provide asymptotically valid test statistics. 


The tests based on regression (11.60) were originally proposed by Cameron 
and Trivedi (1990). They also suggest tests based on regressions like (11.60), 
but with the regressor of (11.60) multiplied by various functions of the fitted 
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values exp(X;8). Common choices are the fitted values themselves or their 
squares. Cameron and Trivedi show that a test in which the regressor is 
multiplied by the function g(exp(X;3)) of the fitted value has greatest power 
against DGPs for which the true variance of y; is of the form exp(X;3) + 
ag(exp(X;3)) for some scalar a. Tests with more than one degree of freedom 
can be performed by using several regressors constructed in this way. In all 
cases, an appropriate test statistic is the ESS. It is asymptotically distributed 
under the null as y?(r), where r is the number of regressors. 


Other tests for overdispersion have been proposed by Cameron and Trivedi 
(1986), Lee (1986), and Mullahy (1997). Note that the finite-sample distribu- 
tions of all these test statistics may differ substantially from their asymptotic 
ones. Better results may well be obtained by using bootstrap P values. A 
parametric bootstrap DGP is appropriate. It can easily be implemented by 
using a procedure for obtaining drawings from the Poisson distribution similar 
to the one we discussed for discrete choice models in the previous section. 


Consequences of Overdispersion in the Poisson Regression Model 


Finding evidence of overdispersion does not necessarily mean that we must 
abandon the Poisson regression model. Since the model is equivalent to 
weighted NLS, and weighted NLS is consistent even when the weights are 
incorrect, the ML estimator B will be consistent whenever the exponential 
mean function A;(@) is correctly specified. In this situation, B is actually a 
quasi-ML estimator, or QMLE; see Section 10.4. However, as is generally the 
case for quasi-ML estimators, the covariance matrix estimator (11.53) will not 
be valid if the entire model is not specified correctly. 


To find the asymptotic covariance matrix of B when the model is not correctly 
specified, we may use the result (10.40), which is true for every quasi-ML 
estimator. If we replace the generic parameter vector 0 of that equation by G, 
we obtain 


Var( plim ni/?(B— Bo)) = H! (Bo) I(Bo) H7! (Go). (11.61) 


For the Poisson regression model, we see from (11.52) that 


HK(Bo) = — plim =; X exp(Xio) XiX, = — plim Z XTY (Bo) X. (11.62) 


From the definitions (10.31) and (10.32), and from the expression given 
in (11.51) for the gradient of the loglikelihood, it follows that the asymptotic 
information matrix is 


I(Bo) = plim + X- w? (o) X/'X, = plim + XTO(60) X, (11.63) 
Ii n— Co 


N— Co 


where w? (Bo) = E(u — exp(X;o))” is the conditional variance of y,, and 
(0) is the diagonal matrix with typical diagonal element w? (Gp). 
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When the model is correctly specified, the conditional variance w? is equal 
to the conditional mean exp(X; 6o), and the asymptotic covariance matrix 
(11.61) simplifies to J~'(@9) = ~H t (Bo). When the model is not correctly 
specified, however, this simplification does not occur. 


One quite plausible specification for the conditional variance of y+ is 


w (B) = 7 exp( X: B), (11.64) 


in which the conditional variance is proportional to the conditional mean. 
Under this specification, the asymptotic covariance matrix (11.61) simplifies 
to 77 times —H-" (Bo). Since this is not a sandwich covariance matrix, it 
is clear that B remains asymptotically efficient in this special case. An easy 
way to estimate this covariance matrix is simply to run the artificial regres- 
sion (11.55), with 8 = Ê. Because s? provides a consistent estimator of y?, 
the OLS covariance matrix from this regression is asymptotically valid; see 
Exercise 11.26. 


Even if we do not specify the conditional variance of y, we can obtain an 
asymptotically valid covariance matrix whenever the matrices (11.62) and 
(11.63) can be estimated consistently. To do this, we need to use a sandwich 
estimator similar to the HCCME discussed in Section 5.5. We can estimate 
(11.62) consistently if we replace Bo by 8. In order to estimate (11.63) con- 
sistently, we replace the conditional variance w? (6o) by the squared residual 
(y; — exp(X;,@))2. Thus a valid estimator of Var() when only the conditional 
mean part of the Poisson regression model is correctly specified is 


Varn (Â) = (XXV 1 X'AX (XTX), (11.65) 


where 2 is the n x n diagonal matrix with diagonal element t given by 
(yı — exp(X;))?. As in Section 5.5, the “h” subscript indicates that the 
matrix (11.65) is valid in the presence of heteroskedasticity of unknown form. 
Given the substantial risk of misspecification, it is strongly recommended to 
use the sandwich estimator (11.65) rather than (11.53) in practical applica- 
tions. Notice that the sandwich estimator is very easy to calculate without 
any special software. If we run the artificial regression (11.55) and ask the 
regression package to compute an HCCME, it will give us either (11.65) or 
something that is asymptotically equal to (11.65); see Exercise 11.27. 


Of course, except in the special case of (11.64), the ML estimator Ê will not 
be asymptotically efficient when the Poisson regression model is not correctly 
specified. The fact that the covariance matrix has the sandwich form makes 
this clear. Moreover, B will not even be consistent if the conditional mean 
function exp(X;3) is not correctly specified. Many other models for count 
data have been suggested, and one or more of them may well fit better than the 
Poisson regression model does. Wooldridge (1999) and Cameron and Trivedi 
(2001) provide more advanced introductions to the topic of count data, and 
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Cameron and Trivedi (1998) provides a detailed treatment of a large number 
of different models for data of this type. 


11.6 Models for Censored and Truncated Data 


Continuous dependent variables can sometimes take only a limited range of 
values. This may happen because they have been censored or truncated in 
some way. These two terms are easily confused. A sample is said to be 
truncated if some observations have been systematically excluded from the 
sample. For example, a sample of households with incomes under $200,000 
explicitly excludes households with incomes over that level. It is not a random 
sample of all households. If the dependent variable is income, or something 
correlated with income, results using the truncated sample could potentially 
be quite misleading. 


On the other hand, a sample has been censored if no observations have been 
systematically excluded, but some of the information contained in them has 
been suppressed. Think of a “censor” who reads people’s mail and blacks 
out certain parts of it. The recipients still get their mail, but parts of it are 
unreadable. To continue the previous example, suppose that households with 
all income levels are included in the sample, but for those with incomes in 
excess of $200,000, the amount reported is always exactly $200,000. This sort 
of censoring is often done in practice, presumably to protect the privacy of 
high-income respondents. In this case, the censored sample is still a random 
sample of all households, but the values reported for high-income households 
are not the true values. 


Any dependent variable that has been either censored or truncated is said to 
be a limited dependent variable. Special methods are needed to deal with 
such variables because, if we simply use least squares, the consequences of 
truncation and censoring can be severe. Consider the regression model 


Ye = Pit frrrt ww ~ NID(O,o7), (11.66) 


where y? is a latent variable. We actually observe y;, which differs from 
yp because it is either truncated or censored. For simplicity, suppose that 
censorship or truncation occurs whenever y? is less than 0. Clearly, the larger 
is the error term uz, the larger will be y?, and thus the greater will be the 
probability that y? > 0. This probability must also depend on x+. Thus, for 
the sample we actually observe, wu; will no longer have conditional mean 0, and 
it will not be uncorrelated with z+. Since the error terms no longer satisfy these 
key assumptions, it is not surprising that OLS estimation using truncated or 
censored samples yields estimators that are biased and inconsistent. 


The consequences of censoring and truncation are illustrated in Figure 11.3. 
The figure shows 200 (xz, yz) pairs generated from the model (11.66). The 
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Figure 11.3 Effects of censoring and truncation 


71 observations with y < 0 are shown as circles, and the 129 observations 
with y > 0 are shown as black dots. The solid line is the true regression 
function, and the nearby dotted line is the regression function obtained by 
OLS estimation using all the observations. When the data are truncated, the 
observations with y; < 0 are discarded. OLS estimation using this truncated 
sample yields the regression line shown in small dots. When the data are 
censored, these 71 observations are retained, but y; is set equal to 0 for all of 
them. OLS estimation using this censored sample yields the dashed regression 
line. Neither of these regression lines is at all close to the true one. 

In this example, the consequences of either censoring or truncation are quite 
severe. Just how they severe they will be in any particular case depends on o°, 
the variance of the error terms in (11.66), and on the extent of the censoring 
or truncation. If ø? is very small relative to the variation in the fitted values, 
so will be the bias induced by limiting the dependent variable. This bias will 
also be small if few observations are censored or truncated. Conversely, when 
g? is large and many observations are censored or truncated, the bias can be 
extremely large. 


Truncated Regression Models 


It is quite simple to estimate a truncated regression model by maximum like- 
lihood if the distribution of the error terms in the latent variable model is 
assumed to be known. By far the most common assumption is that the error 
terms are normally, independently, and identically distributed, as in (11.66). 
We restrict our attention to this special case. 
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If the regression function for the latent variable model is X; 6, the probability 
that yọ is included in the sample is 


Pr(y? > 0) = Pr(X:G + uz > 0) 
=l- Pr (wz < — X;/3) =1— Pr(uz/o < —X;3/c) 
= 1 — 6(-X;B/o) = 6(X;3/c). 
When y? > 0 and y is observed, the density of y is proportional to the 
density of y?. Otherwise, the density of ys is 0. The factor of proportionality, 


which is needed to ensure that the density of y+ integrates to unity, is the 
inverse of the probability that y? > 0. Therefore, the density of y, can be 


written as 
o'd((ye — XB)/c) 
}(X;3/c) 


This implies that the loglikelihood function, which is the sum over all t of the 
log of the density of y+, is 


Uy, 8,0) = — Blog(2n) — nlog(o) — 5-5 (yr — XB) 
n m (11.67) 
- DJ log (X18 /o). 


t=1 


Maximization of expression (11.67) is generally not difficult. Even though 
the loglikelihood function is not globally concave, there is a unique MLE; see 
Orme and Ruud (2000). 


The first three terms in expression (11.67) comprise the loglikelihood function 
that corresponds to OLS regression; see equation (10.10). The last term is 
minus the summation over all t of the logarithms of the probabilities that an 
observation with regression function X; 6 belongs to the sample. Since these 
probabilities must be less than 1, this term must always be positive. It can 
be made larger by making the probabilities smaller. Thus the maximization 
algorithm will choose the parameters in such a way that these probabilities 
are smaller than they would be for the OLS estimates. The presence of this 
fourth term therefore causes the ML estimates of B and o to differ, often 
substantially, from their least squares counterparts, and it ensures that the 
ML estimates are consistent. 


It is not difficult to modify this model to allow for other forms of truncation. 
The sample can be truncated from above, from below, or from both above 
and below. The truncation points must be known, but they can be fixed or 
they can vary across observations. See Exercises 11.29 and 11.30. 
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Censored Regression Models 


The most popular model for censored data is the tobit model, which was 
first suggested in Tobin (1958), which is quite a famous paper. The simplest 
version of the tobit model is 


Ye = Xib +u, u, ~ NID(0, 0°), 
ye =y; if yp >0; y =0 otherwise. 


Here y? is a latent variable that is observed whenever it is positive. However, 
when the latent variable is negative, the observation is censored, and we simply 
observe y; = 0. The tobit model can readily be modified to allow for censoring 
from above instead of from below or for censoring from both above and below. 
It can also be modified to allow the point at which the censoring occurs to 
vary across observations in a deterministic way; see Exercise 11.31. 


The loglikelihood function for the tobit model is a little unusual, but it is not 
difficult to derive. First, it is easy to see that 


Pr(y; = 0) = Pr(yp < 0) = Pr( X18 + w < 0) 
= Pr(= a= iP) 


Oo oO 


= 6(—X,B/o). 


Therefore, since there is a positive probability that y = 0, the contribution 
to the loglikelihood function made by observations with y+ = 0 is not the log 
of the density, but the log of that positive probability, namely, 


(yz, 8,0) = log ®(—X;3/c). (11.68) 


If y, is positive, the density of y exists, and the contribution to the loglikeli- 
hood is its logarithm, 


log(So((u -— X,6)/2)), (11.69) 


which is the contribution to the loglikelihood function for an observation in a 
classical normal linear regression model without any censoring. 

Combining expression (11.68), the contribution for the censored observations, 
with expression (11.69), the contribution for the uncensored ones, we find that 
the loglikelihood function for the tobit model is 


© log &(—XiB/o) + D> log (24 ((ye — XiB)/o)). (11.70). 


yz=0 yt>0 


This loglikelihood function is rather curious. The first term is the sum of the 
logs of probabilities, for the censored observations, while the second is the 
sum of the logs of densities, for the uncensored observations. This reflects the 
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fact that the dependent variable in a tobit model has a distribution that is 
a mixture of discrete and continuous random variables. This fact does not, 
however, prevent the ML estimator for the tobit model from having the usual 
properties of consistency and asymptotic normality, as was shown explicitly 
by Amemiya (1973c). 


It is generally somewhat easier to maximize the loglikelihood function (11.70) 
if the tobit model is reparametrized. The new parameters are y = B/o 
and h = 1/o. Since the loglikelihood function can be shown to be globally 
concave in the latter parametrization (Olsen, 1978), there must be a unique 
maximum no matter which parametrization is used. Even without any repara- 
metrization, it is generally not at all difficult to maximize (11.70) by using a 
quasi-Newton algorithm. 


The (k + 1) x (k +1) covariance matrix of the ML estimates may, as usual, 
be estimated in several ways. Analytic expressions for the information matrix 
exist (Amemiya, 1973c), and at least two artificial regressions are available. 
One of these is the OPG regression that we discussed in Section 10.5, and the 
other is a double-length regression proposed by Orme (1995). The latter is 
substantially more complicated than the former, but it seems to work very 
much better. Since the tobit model is fully specified, it is straightforward 
to employ the parametric bootstrap. Simulation results in Davidson and 
MacKinnon (1999a) suggest that inferences based on it can be much more 
reliable than ones based only on asymptotic theory. 


Testing the Tobit Model 


There is an interesting relationship among the tobit, truncated regression, and 
probit models. If we both add and subtract the term yy >0 log (®(XB/c)) 
from the tobit loglikelihood function (11.70), the latter becomes 


S 108(5 6((ye — X1)/2)) - X log &(X,8/o) 


ue oo (11.71) 
a ` log 6(—X;B/o) + `> log (X+ 6/0). 
y+ =0 y+>0 


The first line of (11.71) is the loglikelihood function for a truncated regression 
model estimated over all the observations for which y, > 0; compare (11.67). 
The second line is the loglikelihood function for a probit model with index 
function X;3/o; compare (11.09). Of course, if all we had was the second 
line here, we could not identify G and o separately, but since we also have the 
first line, that is not a problem. 


Writing the tobit loglikelihood function in the form of (11.71) makes it clear 
that this model is really a probit model combined with a truncated regression 
model, with the coefficient vectors in the two models restricted to be propor- 
tional to each other. This restriction can easily be tested by means of an LR 
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test with k degrees of freedom. If this test leads to a rejection of the null 
hypothesis, then we probably should not be using a tobit model. 


Of course, like all econometric models, the tobit model can and should be 
tested for a variety of types of possible misspecification. A large number of 
tests can be based on the OPG regression and on the double-length regression 
of Orme (1995). Tests based on the OPG regression are discussed by Pagan 
and Vella (1989) and Smith (1989). See also Chesher and Irish (1987). 


11.7 Sample Selectivity 


In the previous section, we considered samples truncated on the basis of the 
value of the dependent variable. Many samples are truncated on the basis of 
another variable that is correlated with the dependent variable. For example, 
people may choose to enter the labor force if their market wage exceeds their 
reservation wage and choose to stay out of it otherwise. Then a sample of 
people who are in the labor force will exclude those whose reservation wage 
exceeds their market wage. If the dependent variable, whatever it may be, 
is correlated with the difference between reservation and market wages, least 
squares will yield inconsistent estimates. In this case, the sample is said to 
have been selected on the basis of this difference. The consequences of this 
type of sample selection are often said to be due to sample selectivity. 


Let us consider a simple model that involves sample selectivity. Suppose 
that y? and z? are two latent variables, generated by the bivariate process 


o 2 
alied] EJee [2 r) ae 
24 Wi Ut vt po 1l 
where X; and W, are vectors of observations on exogenous or predetermined 
variables, 3 and y are unknown parameter vectors, o is the standard deviation 
of u, and p is the correlation between u; and uy. The restriction that the 
variance of v, is equal to 1 is imposed because only the sign of z? will be 


observed. In fact, the variables that are actually observed are y; and z;, and 
they are related to y? and z? as follows: 


Yı =y; if zf >0; ys unobserved otherwise; 
(11.73) 
z= 1 if zp >0; 2 =0 otherwise. 


Thus there are two types of observations, those for which we observe yt = yp 
and z = 1, along with both X; and W,, and those for which we observe only 
zı = 0 and W. 


Each observation contributes a factor to the likelihood function for this model 
that can be written as 


Ia, = 0)Pr(z: = 0) + I (a = 1)Pr(a = 1) fly? |z = 1), 
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where f(y? | z: = 1) denotes the density of y? conditional on z = 1. This 
is the appropriate way to specify the likelihood because, if we integrate with 
respect to yp and sum over the two possible values of z;, the result is 1. Note 
also that the value of y? is needed only if it is observed, that is, if z, = 1. The 
loglikelihood function is 


X log Pr(z = 0) + X log(Pr(z =1)f(yp |% =1)). (11.74) 


z+=0 Bal 


The first term of (11.74), which comes from the observations with z; = 0, is 
exactly the same as the corresponding term in a probit model. The second 
term comes from the observations with z = 1. By using the fact that we can 
factor a joint density any way we please, it can also be written as 


S log(Pr(z = 1 | y?)F(y?)), 


2,1 


where f(y?) is the density of y? conditional on predetermined or exogenous 


variables, which is just a normal density with mean X;,@ and variance o°. 


In order to write out the loglikelihood function (11.74) explicitly, we must 
calculate Pr(z, = 1|y?). Since u; and v; are bivariate normal, we can write 
v = pur/o + €t, where £ is a normally distributed random variable with 
mean 0 and variance 1 — p?. Thus 


z? = Wiyt py? — X:B)/o+e:, c~ NID(0,1— p°). 
Because y; = yp when z = 1, it follows that 


Wir + ply: — ae) 
Leap ye l 


Pr(z = 1ļ|yp)= o( 
Thus the loglikelihood function (11.74) becomes 


Y log &(-Wiy) + D> log (Eo (yw — X:8)/0)) 


24=0 #a=l1 
(11.75) 
Wy + ply: — Xi B)/o 
+X log o( 0 pie l 
z= 


The first term looks like the corresponding term for a standard probit model 
in which z; is explained by W;, the second term looks like the loglikelihood 
function for a linear regression of y; on X+, with normal errors, and the third 
term is one that we have not seen before. If p = 0, this term would collapse to 
the term corresponding to observations with z; = 1 in the probit model for zz, 
and we could estimate the probit model and the regression model separately. 
In general, however, this term forces us to estimate both equations together 
by making the probability that z; = 1 depend on y; — X+ 6. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


480 Discrete and Limited Dependent Variables 


Heckman’s Two-Step Method 


From the point of view of asymptotic efficiency, the best way to estimate the 
model characterized by (11.72) and (11.73) is simply to maximize the loglike- 
lihood function (11.75). With modern computing equipment and appropriate 
software, this is not unreasonably difficult to do, although numerical prob- 
lems can be encountered when p approaches +1. Instead of ML estimation, 
however, it is popular to use a computationally simpler technique, which is 
known as Heckman’s two-step method; see Heckman (1976, 1979). Although 
we do not recommend that practitioners rely solely on this method, it can be 
useful for preliminary work, and it yields insights into the nature of sample 
selectivity. In addition, it provides a good starting point for the nonlinear 
algorithm used to obtain the MLE. 


Heckman’s two-step method is based on the fact that the first equation of 
(11.72), for observations where y; is observed, can be rewritten as 


Yt = Xib + por; + et. (11.76) 


Here the error term u+ is divided into two parts, one perfectly correlated 
with v+, the error term in the equation for the latent variable z?, and one 
independent of v. The idea is to replace the unobserved error term v, in 
(11.76) by its mean conditional on z; = 1 and on the explanatory variables W;. 
This conditional mean is 


(Wy) 


E(v: |z = 1, W4) = E(u: |v: > -Wiy, Wi) = soppy: 
t 


(11.77) 
where readers are asked to prove the last equality in Exercise 11.32. The 
quantity ọ(x)/(x) is known as the inverse Mills ratio; see Johnson, Kotz, 
and Balakrishnan (1994). In the first step of Heckman’s two-step method, an 
ordinary probit model is used to obtain consistent estimates ¥ of the para- 
meters of the selection equation. In the second step, the unobserved 1 in 
regression (11.76) is replaced by the selectivity regressor ¢(W,7)/®(W,4), 
and regression (11.76) becomes 
o(W.4) : 
= X, a dual. 11.78 

y = Xi 8+ po way e (11.78) 
This Heckman regression, as it is often called, is easy to estimate by OLS and 
yields consistent estimates of 6. 


Regression (11.78) provides a test for sample selectivity as well as an estima- 
tion technique. The coefficient of the selectivity regressor is po. Since o Æ 0, 
the ordinary t statistic for this coefficient to be zero can be used to test the 
hypothesis that p = 0, and it will be asymptotically distributed as N(0, 1) 
under the null hypothesis. If this coefficient is not significantly different from 
zero, the investigator may reasonably decide that selectivity is not a problem 
and proceed to use least squares as usual. 
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Although the Heckman regression (11.78) yields consistent estimates of 3, the 
OLS covariance matrix is valid only when p = 0. The problem is that the 
selectivity regressor is being treated like any other regressor, when it is in 
fact part of the error term. It is possible to obtain a valid covariance matrix 
estimate to go along with the two-step estimates of B from (11.78), but the 
calculation is quite cumbersome, and the estimated covariance matrix is not 
always positive definite. See Greene (1981) and Lee (1982) for details. 


It should be stressed that the consistency of this two-step estimator, like 
that of the ML estimator, depends critically on the assumption of bivariate 
normality. This can be seen from the specification of the selectivity regressor 
as the inverse Mills ratio (11.77). When the elements of W, are the same as 
the elements of X;, as is often the case in practice, it is only the nonlinearity 
of the inverse Mills ratio as a function of W;7y that makes the parameters of 
the second-step regression identifiable. The form of the nonlinear relationship 
would be different if the error terms did not follow the normal distribution. 


11.8 Duration Models 


Economists are sometimes interested in how much time elapses before some 
event occurs. For example, they may be interested in the length of labor dis- 
putes (that is, strike duration), the age of first marriage for men and women 
(that is, the duration of the state of being single), the duration of unemploy- 
ment spells, the duration between trades on a stock exchange, or the length 
of time people wait before trading in a car. In this section, we will discuss 
some simple econometric models for duration data of this type. 


In many cases, each observation in the sample consists of a measured duration, 
denoted t;, and a 1 x k vector of exogenous variables, denoted X;. In adopting 
this formulation, we have implicitly ruled out the possibility, which more 
complicated models can allow for, that the exogenous variables may change 
as time passes. To avoid notational confusion, we use 7 to index observations. 
In theory, duration is a nonnegative, continuous random variable. In practice, 
however, t; is often reported as an integer number of weeks or months. When 
it is always a small integer, a count data model like the ones discussed in 
Section 11.5 may be appropriate. However, when t; can take on a large number 
of integer values, it is conventional to model duration as being continuous. 
Almost all of the literature deals with the continuous case. 


Survivor Functions and Hazard Functions 


In practice, interest often centers not so much on how t; is related to X; but 
rather on how the probability that a state will endure varies over the duration 
of the state. For example, we may be interested in seeing how the probability 
that someone will find a job changes as the length of time they have been 
unemployed increases. Before we can answer this sort of question, we need to 
discuss a few fundamental concepts. 
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Suppose that how long a state endures is measured by T, a nonnegative, con- 
tinuous random variable with PDF f(t) and CDF F(t), where t is a realization 
of T. Then the survivor function is defined as 


This is the probability that a state which started at time t = 0 is still going 
on at time t. The probability that it will end in any short period of time, say 
the period from time t to time t + At, is 


Pr(t < T < t+ At) = F(t + At) — F(t). (11.79) 


This probability is unconditional. For many purposes, we may be interested 
in the probability that a state will end between time t and time t + At, con- 
ditional on having reached time t in the first place. This probability is 


F(t + At) — F(t) 


Pr(t<T<t+At|T>t¢t)= TO 


(11.80) 


Since we are dealing with continuous time, it is natural to divide (11.79) and 
(11.80) by At and consider what happens as At — 0. The limit of 1/At 
times (11.79) as At — 0 is simply the PDF f(t), and the limit of 1/At times 


(11.80) is 
fH FŒ) 
h(t) = z é 
(t) S(t) 1- F(t) 
The function A(t) defined in (11.81) is called the hazard function. For many 


purposes, it is more interesting to model the hazard function than to model 
the survivor function directly. 


(11.81) 


Functional Forms 


For a parametric model of duration, we need to specify a functional form for 
one of the functions F(t), S(t), f(t), or h(t), which then implies functional 
forms for the others. One of the simplest possible choices is the exponential 
distribution, which was discussed in Section 10.2. For this distribution, 


f(t,0)=0e-", and F(t,@)=1-e-%, @>0. 


Therefore, the hazard function is 


Thus, if duration follows an exponential distribution, the hazard function is 
simply a constant. 
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Since the restriction that the hazard function is a constant is a very strong 
one, the exponential distribution is rarely used in applied work. A much more 
flexible functional form is provided by the Weibull distribution, which has two 
parameters, 0 and a. For this distribution, 


F(t, 0, a) = 1 — exp(—(62)°). (11.82) 


As readers are asked to show in Exercise 11.33, the survivor, density, and 
hazard functions for the Weibull distribution are as follows: 


S(t) = exp(—(6t)*); 
f(t) = a6°t**exp(—(6t)*); (11.83) 
hier. 


When a = 1, it is easy to see that the Weibull distribution collapses to 
the exponential, and the hazard is just a constant. For a < 1, the hazard is 
decreasing over time, and for a > 1, the hazard is increasing. Hazard functions 
of the former type are said to exhibit negative duration dependence, while 
those of the latter type are said to exhibit positive duration dependence. In 
the same way, a constant hazard is said to be duration independent. 


Although the Weibull distribution is not nearly as restrictive as the exponen- 
tial, it does not allow for the possibility that the hazard may first increase 
and then decrease over time, which is something that is frequently observed 
in practice. Various other distributions do allow for this type of behavior. A 
particularly simple one is the lognormal distribution, which was discussed in 
Section 9.6. Suppose that logt is distributed as N(u,07). Then we have 


P(t) = ®(Z(logt — 1), 
S(t) =1- ®(Z(logt — 1) = o(—Z(logt — 1), 
f(t) = —4(L(ogt — p)), and 


_ 1 (log t — 1)/c) 
ae ot ®(—(logt — p)/o) ` 


For this distribution, the hazard rises quite rapidly and then falls rather slowly. 
This behavior can be observed in Figure 11.4, which shows several hazard 
functions based on the exponential, Weibull, and lognormal distributions. 


Maximum Likelihood Estimation 


It is reasonably straightforward to estimate many duration models by maxi- 
mum likelihood. In the simplest case, the data consist of n observations t; on 
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Figure 11.4 Various hazard functions 


observed durations, each with an associated regressor vector X;. Then the 
loglikelihood function for t, the entire vector of observations, is just 


i=1 
where f(t; |X;,@) denotes the density of t; conditional on the data vector 


X; for the parameter vector 0. In many cases, it may be easier to write the 
loglikelihood function as 


O(t,0) = Slog h(t; | X;,0) + X log S(t; | Xi, 6), (11.85) 
i=1 i=1 

where h(t; | X;, 0) is the hazard function and S(t; | X;, 0) is the survivor func- 
tion. The equivalence of (11.84) and (11.85) is ensured by (11.81), in which 
the hazard function was defined. 
As with other models we have looked at in this chapter, it is convenient to let 
the loglikelihood depend on explanatory variables through an index function. 
As an example, suppose that duration follows a Weibull distribution, with 
a parameter 6; for observation i that has the form of the exponential mean 
function (11.48), so that 6; = exp(X;@) > 0. From (11.83) we see that the 
hazard and survivor functions for observation 7 are 


a expla X; ß)t®! and exp(—t“ expla X;ß)), 
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respectively. In practice, it is simpler to absorb the factor of œ into the 
parameter vector 3, so as to yield an exponent of just X; in these expressions. 
Then the loglikelihood function (11.85) becomes 


n n 


L(t, B,a) = nloga + 5O XB + (a — DS k = X tf exp(X;B), 


t=1 t=1 t=1 


and ML estimates of the parameters a and @ are obtained by maximizing this 
function in the usual way. 


In practice, many data sets contain observations for which t; is not actually 
observed. For example, if we have a sample of people who entered unemploy- 
ment at various points in time, it is extremely likely that some people in the 
sample were still unemployed when data collection ended. If we omit such 
observations, we are effectively using a truncated data set, and we will there- 
fore obtain inconsistent estimates. However, if we include them but treat the 
observed t; as if they were the lengths of completed spells of unemployment, 
we will also obtain inconsistent estimates. In both cases, the inconsistency 
occurs for essentially the same reasons as it does when we apply OLS to a 
sample that has been truncated or censored; see Section 11.6. 


If we are using ML estimation, it is easy enough to deal with duration data 
that have been censored in this way, provided we know that censorship has 
occurred. For ordinary, uncensored observations, the contribution to the log- 
likelihood function is a contribution like those in (11.84) or (11.85). For 
censored observations, where the observed t; is the duration of an incomplete 
spell, it is the logarithm of the probability of censoring, which is the proba- 
bility that the duration exceeds t;, that is, the log of the survivor function. 
Therefore, if U denotes the set of uncensored observations, the loglikelihood 
function for the entire sample can be written as 


L(t,0) = X log h(t; | X:, 0) + X` log S(t: | Xi, 0). (11.86) 


iEU i=1 


Notice that uncensored observations contribute to both terms in (11.86), while 
censored observations contribute only to the second term. When there is no 
censoring, the same observations contribute to both terms, and (11.86) reduces 
to (11.85). 


Proportional Hazard Models 


One class of models that is quite widely used is the class of proportional hazard 
models, originally proposed by Cox (1972), in which the hazard function for 
the it! economic agent is given by 


h(Xi, t) = gi( Xi) g2(t), (11.87) 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


486 Discrete and Limited Dependent Variables 


for various specifications of the functions gi(X;) and g2(t). The latter is called 
the baseline hazard function. An implication of (11.87) is that the ratio of 
the hazards for any two agents, say the ones indexed by 7 and j, depends on 
the regressors but does not depend on t. This ratio is 


h(Xi t) — gi(Xi)go(t) — gi( Xi) 


h(X;,t) aX) aX) 


Thus the ratio of the conditional probability that agent i will exit the state to 
the probability that agent j will do so is constrained to be the same for all t. 
This makes proportional hazard models econometrically convenient, but they 
do impose fairly strong restrictions on behavior. 


Both the exponential and Weibull distributions lead to proportional hazard 
models. As we have already seen, a natural specification of gı(X;) for these 
models is exp( X; 6). For the exponential distribution, the baseline hazard 
function is just 1, and for the Weibull distribution it is at?~1. 


One attractive feature of proportional hazards models is that it is possible to 
obtain consistent estimates of the parameters of the function gi(X;), without 
estimating those of g2(t) at all, by using a method called partial likelihood 
which we will not attempt to describe; see Cox and Oakes (1984) or Lancaster 
(1990). The baseline hazard function g2(t) can then be estimated in various 
ways, some of which do not require us to specify its functional form. 


Complications 


The class of duration models that we have discussed is quite limited. It does 
not allow the exogenous variables to change over time, and it does not allow 
for any individual heterogeneity, that is, variation in the hazard function 
across agents. The latter has serious implications for econometric inference. 
Suppose, for simplicity, that there are two types of agent, each with a constant 
hazard, which is twice as high for agents of type H as for those of type L. If 
we estimate a duration model for all agents together, we will observe negative 
duration dependence, because the type H agents will exit the state more 
rapidly than the type L agents, and the ratio of type H to type L agents will 
decline as duration increases. 


There has been a great deal of work on duration models during the past 
two decades, and there are numerous models that allow for time-varying ex- 
planatory variables and/or individual heterogeneity. Classic references are 
Heckman and Singer (1984), Kiefer (1988), and Lancaster (1990). More re- 
cent work is discussed in Neumann (1999), Gouriéroux and Jasiak (2001), and 
van den Berg (2001). 
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11.9 Final Remarks 


This chapter has dealt with a large number of types of dependent variable for 
which ordinary regression models are not appropriate: binary dependent vari- 
ables (Sections 11.2 and 11.3); discrete dependent variables that can take on 
more than two values, which may or may not be ordered (Section 11.4); count 
data (Section 11.5); limited dependent variables, which may be either cen- 
sored or truncated (Section 11.6); dependent variables where the observations 
included in the sample have been determined endogenously (Section 11.7); 
and duration data (Section 11.8). In most cases, we have made strong dis- 
tributional assumptions and relied on maximum likelihood estimation. This 
is generally the easiest way to proceed, but it can lead to seriously mislead- 
ing results if the assumptions are false. It is therefore important that the 
specification of these models be tested carefully. 


11.10 Exercises 


11.1 Consider the contribution made by observation t to the loglikelihood func- 
tion (11.09) for a binary response model. Show that this contribution is glob- 
ally concave with respect to @ if the function F is such that F(—x) = 1— F(x), 
and if it, its derivative f, and its second derivative f’ satisfy the condition 


f' (2) F(x) — f?(a) <0 (11.88) 


for all real finite z. 


Show that condition (11.88) is satisfied by both the logistic function A(-), 
defined in (11.07), and the standard normal CDF ®(-). 


11.2 Prove that, for the logit model, the likelihood equations (11.10) reduce to 


So Xul - A(X&:6)) =0, i=1,...,k. 
t=1 


11.3 Show that the efficient GMM estimating equations (9.82), when applied to the 
binary response model specified by (11.01), are equivalent to the likelihood 
equations (11.10). 


11.4 If F\(-) and F(-) are two CDFs defined on the real line, show that any 
convex combination (1 — a)F(-) + @F)(-) of them is also a properly defined 
CDF. Use this fact to construct a model that nests the logit model for which 
Pr(yt = 1) = A(X;) and the probit model for which Pr(yz = 1) = #( X: 8) 
with just one additional parameter. 


11.5 Consider the latent variable model 
ye =GitGorr_tut, ut ~ N(0,1), 


y= lify >0, y =O ify; <0. 
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Suppose that xe ~ N(0,1). Generate 500 samples of 20 observations on 
(xt, yt) pairs, 100 assuming that 6; = 0 and $2 = 1, 100 assuming that 
G6, = 1 and 62 = 1, 100 assuming that 6; = —1 and 62 = 1, 100 assuming 
that 6; = 0 and 6; = 2, 100 assuming that 6; = 0 and 6, = —2, and 100 
assuming that 6, = 0 and B2 = 3. For each of the 500 samples, attempt to 
estimate a probit model. In each of the five cases, what proportion of the 
time does the estimation fail because of perfect classifiers? Explain why there 
were more failures in some cases than in others. 


Repeat this exercise for five sets of 100 samples of size 40, with the same 
parameter values. What do you conclude about the effect of sample size on 
the perfect classifier problem? 


Suppose that there is quasi-complete separation of the data used to estimate 
the binary response model (11.01), with a transformation function F such 
that F(—x) = 1 — F(x) for all real x, and a separating hyperplane defined 
by the parameter vector 3°. Show that the upper bound of the loglikelihood 
function (11.09) is equal to — n, log 2, where nz is the number of observations 
for which X; 8° = 0. 


The contribution to the loglikelihood function (11.09) made by observation t 
is yt log F(X¢B) + (1 — yt) log(1 — F(X+B)). First, find Gy, the derivative 
of this contribution with respect to Bi. Next, show that the expectation of 
Gi; is zero when it is evaluated at the true 3. Then obtain a typical element 
of the asymptotic information matrix by using the fact that it is equal to 
limno nt ai E(GGrj). Finally, show that the asymptotic covariance 
matrix (11.15) is equal to the inverse of this asymptotic information matrix. 


Calculate the Hessian matrix corresponding to the loglikelihood function 
(11.09). Then use the fact that minus the expectation of the asymptotic 
Hessian is equal to the asymptotic information matrix to obtain the same 
result for the latter that you obtained in the previous exercise. 


Plot %(), which is defined in equation (11.16), as a function of X;@ for 
both the logit and probit models. For the logit model only, prove that Y; (6) 
achieves its maximum value when X;3 = 0 and declines monotonically as 
|X; 3| increases. 


The file participation.data, which is taken from Gerfin (1996), contains data 
for 872 Swiss women who may or may not participate in the labor force. The 
variables in the file are: 


yt Labor force participation variable (0 or 1). 

I, Log of nonlabor income. 

Az Age in decades (years divided by 10). 

Ey Education in years. 
nut Number of children under 7 years of age. 
not Number of children over 7 years of age. 
Citizenship dummy variable (1 if not Swiss). 


Es 


The dependent variable is y4. For the standard specification, the regressors 
are all of the other variables, plus AF, Estimate the standard specification as 
both a probit and a logit model. Is there any reason to prefer one of these 
two models? 
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11.11 


11.12 


11.13 


11.14 


11.15 


11.16 


For the probit model estimated in Exercise 11.10, obtain at least three sensible 
sets of standard error estimates. If possible, these should include ones based 
on the Hessian, ones based on the OPG estimator (10.44), and ones based on 
the information matrix estimator (11.18). You may make use of the BRMR, 
regression (11.20), and/or the OPG regression (10.72), if appropriate. 


Test the hypothesis that the probit model estimated in Exercise 11.10 should 
include two additional regressors, namely, the squares of nuz and noz. Do this 


in three different ways, by calculating an LR statistic and two LM statistics 
based on the OPG and BRMR regressions. 


Use the BRMR (11.30) to test the specification of the probit model estimated 
in Exercise 11.10. Then use the BRMR (11.26) to test for heteroskedasticity, 
where Z+ consists of all the regressors except the constant term. 


Show, by use of l’H6pital’s rule or otherwise, that the two results in (11.29) 
hold for all functions r(-) which satisfy conditions (11.28). 


For the probit model estimated in Exercise 11.10, the estimated probability 
that yz = 1 for observation t is (X;8). Compute this estimated probability 
for every observation, and also compute two confidence intervals at the .95 
level for the actual probabilities. Both confidence intervals should be based 
on the covariance matrix estimator (11.18). One of them should use the delta 
method (Section 5.6), and the other should be obtained by transforming the 
end points of a confidence interval for the index function. Compare the two 
intervals for the observations numbered 2, 63, and 311 in the sample. Are 
both intervals symmetric about the estimated probability? Which of them 
provides more reasonable answers? 


Consider the expression 
J . 
— tog (> exp(Wij6")), (11.89) 
j=0 


which appears in the loglikelihood function (11.35) of the multinomial logit 
model. Let the vector 3! have kj components, let k = ko +...+ kj, and let 
BHR oc B7). The k x k Hessian matrix H of (11.89) with respect to 8 
can be partitioned into blocks of dimension k; x kj, i = 0,..., J, j =0,...,J, 
containing the second-order partial derivatives of (11.89) with respect to an 
element of 3° and an element of Bi. Show that, for i Æ j, the (i, j) block can 
be written as 
PiPj Wal Wij, 


where p; = exp(WnB*)/ (Eio exp(W;;37)) is the probability ascribed to 


choice i by the multinomial logit model. Then show that the diagonal 
(7,2) block can be written as 


—pi(1 — pi) Wii Wri. 


Let the k-vector a be partitioned conformably with the above partitioning 
of the Hessian H, so that we can write a = [ag i ... i aj], where each of the 
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11.19 


11.20 
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vectors a; has k; components for j = 0,..., J. Show that the quadratic form 


a' Ha is equal 6 
J j J 
(X rw) -X jw}, (11.90) 
j=0 j=0 


where the scalar product w; is defined as W;jaj. 


Show that expression (11.90) is nonpositive, and explain why this result shows 
that the multinomial logit loglikelihood function (11.35) is globally concave. 


Show that the nested logit model reduces to the multinomial logit model if 
0i = 1 for all i = 1,...,m. Then show that it also does so if all the subsets A; 
used to define the former model are singletons. 


Show that the expectation of the Hessian of the loglikelihood function (11.41), 
evaluated at the parameter vector 0, is equal to the negative of the k x k matrix 


Doa 5 T3 (©)Tu(0), (11.91) 


where T}; (0) is the 1 x k vector of partial derivatives of I[;;(@) with respect to 
the components of 6. Demonstrate that (11.91) can also be computed using 
the outer product of the gradient definition of the information matrix. 


Use the above result to show that the matrix of sums of squares and cross- 
products of the regressors of the DCAR, regression (11.42), evaluated at 6, 
is I(@). Show further that 1/s? times the estimated OLS covariance matrix 
from (11.42) is an asymptotically valid estimate of the covariance matrix of 
the MLE @ if the artificial variables are evaluated at 6. 


Let the one-step estimator Ò be defined as usual for the discrete choice 
artificial regression (11.42) evaluated at a root-n consistent estimator 6 as 
6=6+ b, where 6 is the vector of OLS parameter estimates from (11.42). 
Show that @ is asymptotically equivalent to the MLE ô. 


Consider the binary choice model characterized by the probabilities (11.01). 
Both the BRMR (11.20) and the DCAR (11.42) with J = 1 apply to this 
model, but the two artificial regressions are obviously different, since the 
BRMR has n artificial observations when the sample size is n, while the DCAR 
has 2n. Show that the two artificial regressions are nevertheless equivalent, in 
the sense that all scalar products of corresponding pairs of artificial variables, 
regressand or regressor, are identical for the two regressions. 


In terms of the notation of the DCAR, regression (11.42), the probability M+; 
that yx = j, j = 0,...,J, for the nested logit model is given by expres- 


sion (11.40). Show that, if the index i(j) is such that j € Ajj), the partial 


derivative of Ij with respect to 0;, evaluated at 0, = 1 for k = 1,...,m, 
where m is the number of subsets Az, is 
Ou, ; 
50; = Teg (diji Ytj — > Iava). 
lEAi 


Here t= -WB + hiiç), where hz; denotes the inclusive value (11.39) of 
subset A;, and 6;; is the Kronecker delta. 
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11.22 


11.23 


11.24 


11.25 


11.26 


11.27 


11.28 


When 0; = 1, k =1,...,m, the nested logit probabilities reduce to the multi- 
nomial logit probabilities (11.34). Show that, if the H+; are given by (11.34), 
then the vector of partial derivatives of I+; with respect to the components 
of B is Ty; We (dj. — Hy). 


Explain how to use the DCAR (11.42) to test the IIA assumption for the 
conditional logit model (11.36). This involves testing it against the nested 
logit model (11.40) with the BÍ constrained to be the same. Do this for the 
special case in which J = 2, Aj = {0,1}, A2 = {2}. Hint: Use the results 
proved in the preceding exercise. 


Using the fact that the infinite series expansion of the exponential function, 
convergent for all real z, is 


co n 


z 
exp z = J —, 
n! 


n=0 


where by convention we define 0! = 1, show that Dee, eTÀAY /y! = 1, and 
that therefore the Poisson distribution defined by (11.58) is well defined on 
the nonnegative integers. Then show that the expectation and variance of a 
random variable Y that follows the Poisson distribution are both equal to A. 


Let the n* uncentered moment of the Poisson distribution with parameter A 
be denoted by Mn(A). Show that these moments can be generated by the 
recurrence My+41(A) = A(Mn(A) + Mh (A)), where Mj,(A) is the derivative of 
Mn(A). Using this result, show that the third and fourth central moments of 
the Poisson distribution are À and A + ay. respectively. 


Explain precisely how you would use the artificial regression (11.55) to test the 
hypothesis that G2 = O in the Poisson regression model for which Az(G) = 
exp(X7131 + X262). Here 31 is a ky-vector and 62 is a kg-vector, with 
k = kı + k2. Consider two cases, one in which the model is estimated subject 
to the restriction and one in which it is estimated unrestrictedly. 


Suppose that yz is a count variable, with conditional mean E(y;) = exp( X+ 68) 
and conditional variance E(y¢ — exp(X;3))” =, exp(Xz). Show that ML 
estimates of B under the incorrect assumption that yz is generated by a Pois- 
son regression model with mean exp(X;) will be asymptotically efficient 
in this case. Also show that the OLS covariance matrix from the artificial 
regression (11.55) will be asymptotically valid. 


Suppose that yz is a count variable with conditional mean E(y) = exp( X+ 68) 
and unknown conditional variance. Show that, if the artificial regression 
(11.55) is evaluated at the ML estimates for a Poisson regression model which 
specifies the conditional mean correctly, the HCCME HCo for that artificial 
regression will be numerically equal to expression (11.65), which is an asymp- 
totically valid covariance matrix estimator in this case. 


The file count.data, which is taken from Gurmu (1997), contains data for 485 
household heads who may or may not have visited a doctor during a certain 
period of time. The variables in the file are: 


yt Number of doctor visits (a nonnegative integer). 
Ct Number of children in the household. 
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Az A measure of access to health care. 
Hy A measure of health status. 


Using these data, obtain ML estimates of a Poisson regression model to explain 
the variable yz, where 


At(B) = exp(G1 + B2Ct + 63At + 64H). 


In addition to the estimates of the parameters, report three different standard 
errors. One of these should be based on the inverse of the information matrix, 
which is valid only when the model is correctly specified. The other two 
should be computed using the artificial regression (11.55). One of them should 
be valid under the assumption that the conditional variance is proportional 
to Az(3), and the other should be valid whenever the conditional mean is 
specified correctly. Can you explain the differences among the three sets of 
standard errors? 


Test the model for overdispersion in two different ways. One test should be 
based on the OPG regression, and the other should be based on the testing 
regression (11.60). Note that this model is not the one actually estimated in 
Gurmu (1997). 


Consider the latent variable model 
yp = XiB+u, uz ~ NID(0,o”), (11.92) 


where ye = yp whenever yp < y™®* and is not observed otherwise. Write 
down the loglikelihood function for a sample of n observations on yt. 


As in the previous question, suppose that yr is given by (11.92). Assume that 
Yt = yp whenever y™™ < yp < y™** and is not observed otherwise. Write 
down the loglikelihood function for a sample of n observations on yt. 


Suppose that y? = X43 + ut with up ~ NID(0, o°). Suppose further that 
ye = yp if ye < yf, and y = yf otherwise, where yf is the known value 
at which censoring occurs for observation t. Write down the loglikelihood 
function for this model. 


Let z be distributed as N(0,1). Show that E(z|z < x) = —¢(x)/®(z), 
where ® and ¢ are, respectively, the CDF and PDF of the standard normal 
distribution. Then show that E(z|z > x) = $(x)/®(—2x) = ¢(—2x)/®(-2). 
The second result explains why the inverse Mills ratio appears in (11.77). 

Starting from expression (11.82) for the CDF of the Weibull distribution, 


show that the survivor function, the PDF, and the hazard function are as 
given in (11.83). 
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12.1 Introduction 


Up to this point, almost all the models we have discussed have involved just 
one equation. In most cases, there has been only one equation because there 
has been only one dependent variable. Even in the few cases in which there 
were several dependent variables, interest centered on just one of them. For 
example, in the case of the simultaneous equations model that was discussed 
in Chapter 8, we chose to estimate just one structural equation at a time. 


In this chapter, we discuss models which jointly determine the values of two or 
more dependent variables using two or more equations. Such models are called 
multivariate because they attempt to explain multiple dependent variables. 
As we will see, the class of multivariate models is considerably larger than 
the class of simultaneous equations models. Every simultaneous equations 
model is a multivariate model, but many interesting multivariate models are 
not simultaneous equations models. 


In the next section, which is quite long, we provide a detailed discussion of 
GLS, feasible GLS, and ML estimation of systems of linear regressions. Then, 
in Section 12.3, we discuss the estimation of systems of nonlinear equations 
which may involve cross-equation restrictions but do not involve simultaneity. 
Next, in Section 12.4, we provide a much more detailed treatment of the linear 
simultaneous equations model than we did in Chapter 8. We approach it from 
the point of view of GMM estimation, which leads to the well-known 3SLS 
estimator. In Section 12.5, we discuss the application of maximum likelihood 
to this model. Finally, in Section 12.6, we briefly discuss some of the methods 
for estimating nonlinear simultaneous equations models. 


12.2 Seemingly Unrelated Linear Regressions 


The multivariate linear regression model was investigated by Zellner (1962), 
who called it the seemingly unrelated regressions model. An SUR system, as 
such a model is often called, involves n observations on each of g dependent 
variables. In principle, these could be any set of variables measured at the 
same points in time or for the same cross-section. In practice, however, the 
dependent variables are often quite similar to each other. For example, in the 
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time-series context, each of them might be the output of a different industry 
or the inflation rate for a different country. In view of this, it might seem more 
appropriate to speak of “seemingly related regressions,” but the terminology 
is too well-established to change. 


We suppose that there are g dependent variables indexed by i. Let y; denote 
the n-vector of observations on the it” dependent variable, X; denote the 
n x ki matrix of regressors for the it? equation, B; denote the k;-vector of 
parameters, and u; denote the n—-vector of error terms. Then the i* equation 
of a multivariate linear regression model may be written as 


Yi = X; 8; + ui, E(u; ) = diln, (12.01) 


where I, is the n x n identity matrix. The reason we use o;; to denote the 
variance of the error terms will become apparent shortly. In most cases, some 
columns are common to two or more of the matrices X;. For instance, if every 
equation has a constant term, each of the X; must contain a column of 1s. 


Since equation (12.01) is just a linear regression model with IID errors, we can 
perfectly well estimate it by ordinary least squares if we assume that all the 
columns of X; are either exogenous or predetermined. If we do this, however, 
we ignore the possibility that the error terms may be correlated across the 
equations of the system. In many cases, it is plausible that usi, the error 
term for observation t of equation 7, should be correlated with uz;, the error 
term for observation t of equation 7. For example, we might expect that a 
macroeconomic shock which affects the inflation rate in one country would 
simultaneously affect the inflation rate in other countries as well. 


To allow for this possibility, the assumption that is usually made about the 
error terms in the model (12.01) is 


E(UtiUtj) = gij for allt, E(utiusj) =0 for all t Æ s, (12.02) 


where gij is the ijt element of the g x g positive definite matrix X. This 
assumption allows all the u; for a given t to be correlated, but it specifies 
that they are homoskedastic and independent across t. The matrix X is called 
the contemporaneous covariance matrix, a term inspired by the time-series 
context. The error terms uz; may be arranged into an n x g matrix U, of 
which a typical row is the 1 x g vector U;. It then follows from (12.02) that 


E(U,'U;) = —E(U'U) = X. (12.03) 


1 
n 
If we combine equations (12.01), for i = 1,...,g, with assumption (12.02), we 
obtain the classical SUR model. 


We have not yet made any sort of exogeneity or predeterminedness assump- 
tion. A rather strong assumption is that E(U | X) = O, where X isann xl 
matrix with full rank, the set of columns of which is the union of all the linearly 
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independent columns of all the matrices X;. Thus / is the total number of 
variables that appear in any of the X; matrices. This exogeneity assumption, 
which is the analog of assumption (3.08) for univariate regression models, is 
undoubtedly too strong in many cases. A considerably weaker assumption is 
that E(U; | X+) = 0, where X; is the tt? row of X. This is the analog of the 
predeterminedness assumption (3.10) for univariate regression models. The 
results that we will state are valid under either of these assumptions. 


Precisely how we want to estimate a linear SUR system depends on what 
further assumptions we make about the matrix X and the distribution of 
the error terms. In the simplest case, X is assumed to be known, at least 
up to a scalar factor, and the distribution of the error terms is unspecified. 
The appropriate estimation method is then generalized least squares. If we 
relax the assumption that X is known, then we need to use feasible GLS. If 
we continue to assume that X is unknown but impose the assumption that 
the error terms are normally distributed, then we may want to use maximum 
likelihood, which is generally consistent even when the normality assumption 
is false. In practice, both feasible GLS and ML are widely used. 


GLS Estimation with a Known Covariance Matrix 


Even though it is rarely a realistic assumption, we begin by assuming that the 
contemporaneous covariance matrix X of a linear SUR system is known, and 
we consider how to estimate the model by GLS. Once we have seen how to 
do so, it will be easy to see how to estimate such a model by other methods. 
The trick is to convert a system of g linear equations and n observations into 
what looks like a single equation with gn observations and a known gn x gn 
covariance matrix that depends on X. 


By making appropriate definitions, we can write the entire SUR system of 
which a typical equation is (12.01) as 


Yo = Xoo + Us. (12.04) 


Here Ye is a gn-vector consisting of the n-vectors yı through yg stacked 
vertically, and ue is similarly the vector of u through ug stacked vertically. 
The matrix X. is a gn x k block-diagonal matrix, where k is equal to )77_, ki. 
The diagonal blocks are the matrices X; through X,. Thus we have 


xX O. O 
O X -: O 

Sete 2. « wie (12.05) 
Oo O ~ X, 


where each of the O blocks has n rows and as many columns as the X; block 
that it shares those columns with. To be conformable with X,, the vector Be 
is a k-vector consisting of the vectors 3; through 8; stacked vertically. 
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From the above definitions and the rules for matrix multiplication, it is not 
difficult to see that 


yı Xıĝðı ui 
= Ye = Xebe + Ue = : + : 
Ug XyB, Ug 


Thus it is apparent that the single equation (12.04) is precisely what we 
obtain by stacking the equations (12.01) vertically, for i = 1,...,g. Using the 
notation of (12.04), we can write the OLS estimator for the entire system very 
compactly as 

BO’ = (XIXA IX TY, (12.06) 
as readers are asked to verify in Exercise 12.4. But the assumptions we have 
made about ue imply that this estimator is not efficient. 


The next step is to figure out the covariance matrix of the vector ue. Since the 
error terms are assumed to have mean zero, this matrix is just the expectation 
of the matrix Ueu. Under assumption (12.02), we find that 


E(uurl) - E(wrud) 
E(tste ) = : ; 
E(ugt') =- E(ugug) (12.07) 
Oiln Gigin 
= : : : =>. 
Ogiln = Gggin 


Here, Xe is asymmetric gn x gn covariance matrix. In Exercise 12.1, readers 
are asked to show that X. is positive definite whenever X is. 


The matrix Xe can be written more compactly as Xe = X @ I, if we use 
the Kronecker product symbol ®. The Kronecker product AQ B of a p x q 
matrix A and an r x s matrix B is a pr x qs matrix consisting of pq blocks, 
laid out in the pattern of the elements of A. For i = 1,...,p and j =1,...,q, 
the ijt? block of the Kronecker product is the r x s matrix aijB, where a;; is 
the ij*® element of A. As can be seen from (12.07), that is exactly how the 
blocks of X. are defined in terms of I,, and the elements of X. 


Kronecker products have a number of useful properties. In particular, if A, 
B, C, and D are conformable matrices, then the following relationships hold: 


(A8 B)'=A' 8 B', 
(A8 B)(C ® D) = (AC) 8 (BD), and (12.08) 
(A@B)'=A'@B". 
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Of course, the last line of (12.08) can be true only for nonsingular, square 
matrices A and B. The Kronecker product is not commutative, by which we 
mean that A & B and B® A are different matrices. However, the elements 
of these two products are the same; they are just laid out differently. In fact, 
it can be shown that B & A can be obtained from A & B by a sequence of 
interchanges of rows and columns. Exercise 12.2 asks readers to prove these 
properties of Kronecker products. For an exceedingly detailed discussion of 
the properties of Kronecker products, see Magnus and Neudecker (1988). 


As we have seen, the system of equations defined by (12.01) and (12.02) is 
equivalent to the single equation (12.04), with gn observations and error terms 
that have covariance matrix Xe. Therefore, when the matrix X is known, we 
can obtain consistent and efficient estimates of the G;, or equivalently of Be, 
simply by using the classical GLS estimator (7.04). We find that 


po = (XI Ss xX.) XL Ss lye 


= (XE! @I,)X.) XNE! @In)ye, (12.09) 


where, to obtain the second line, we have used the last of equations (12.08). 
This GLS estimator is sometimes called the SUR estimator. From the result 
(7.05) for GLS estimation, its covariance matrix is 


Var(ÂSTS) = (X57! 2 In) X.Y `. 


(12.10) 
Since X is assumed to be known, we can use this covariance matrix directly, 
because there are no variance parameters to estimate. 


As in the univariate case, there is a criterion function associated with the GLS 
estimator (7.04). This criterion function is simply expression (7.06) adapted 
to the model (12.04), namely, 


(Ye — Xebe) (X71 8 In) (Ye — Xebe). (12.11) 


The first-order conditions for the minimization of (12.11) with respect to Be 
can be written as 
XS @ Ly- X08.) = 0. TEAR 


These moment conditions, which are analogous to conditions (7.07) for the 
case of univariate GLS estimation, can be interpreted as a set of estimating 
equations that define the GLS estimator (12.09). 


In the slightly less unrealistic situation in which X is assumed to be known 
only up to a scalar factor, so that X = o°A, the form of (12.09) would be 
unchanged, but with A replacing X, and the covariance matrix (12.10) would 
become 

Var (BSS) = o?(XJ(A! @In)X) 
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In practice, to estimate Var( gGLS) we replace a” by something that estimates 


it consistently. Two natural estimators are 


1 
ô? = — ù (^A! Q In)ûe, and 
gn 
1 
2 — ~Tr A=l 
s = Ue (A & I, Ue; 
(gn — k) l ) 


where tt. denotes the vector of error terms from GLS estimation of (12.04). 
The first estimator is analogous to the ML estimator of o? in the linear re- 
gression model, and the second one is analogous to the OLS estimator. 


At this point, a word of warning is in order. Although the GLS estimator 
(12.09) has quite a simple form, it can be expensive to compute when gn 
is large. In consequence, no sensible regression package would actually use 
this formula. We can proceed more efficiently by working directly with the 
estimating equations (12.12). Writing them out explicitly, we obtain 


XS N Ge) 


XI os O fol, =- oly] [mw —xbo™ 
O nant X? oT, ss GIST, Yq — Ap 
ox,’ See a9 X! Yı — X, ĝCTS 
= : “a, ; : =0, (12.13) 
oX] Des oI X} Yq — zoe 


where otf denotes the ij = element of the matrix X71. By solving the k 
equations (12.13) for the G;, we find easily enough (see Exercise 12.5) that 


o XIX <<: o!I XX, -1 22 oi Xi yj 
ps ie . (12.14) 

oP XIX, +++ o” XIX; a1 oI Xg Yj 
Although this expression may look more complicated than (12.09), it is much 
less costly to compute. Recall that we grouped all the linearly independent 
explanatory variables of the entire SUR system into the n x l matrix X. By 
computing the matrix product X! X, we may obtain all the blocks of the form 
X? X; merely by selecting the appropriate rows and corresponding columns 


of this product. Similarly, if we form the n x g matrix Y by stacking the g 
dependent variables horizontally rather than vertically, so that 


Y =|y ae Yal; 
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then all the vectors of the form Xy; needed on the right-hand side of (12.14) 
can be extracted as a selection of the elements of the jt column of the 
product X'Y. 


The covariance matrix (12.10) can also be expressed in a form more suitable 
for computation. By a calculation just like the one that gave us (12.13), we 
see that (12.10) can be expressed as 

PIXTA, = FUAT 


Var (GES) = (12.15) 


o’ X} Xı oan oIIX |X, 


Again, all the blocks here are selections of rows and columns of X'X. 


For the purposes of further analysis, the estimating equations (12.13) can be 
expressed more concisely by writing out the it! row as follows: 


g 
Xo X] (y; — Xj 85") =0. (12.16) 
j=l 


The matrix equation (12.13) is clearly equivalent to the set of equations (12.16) 
fori = J eeg: 


Feasible GLS Estimation 


In practice, the contemporaneous covariance matrix X is very rarely known. 
When it is not, the easiest approach is simply to replace X in (12.09) by a 
matrix that estimates it consistently. In principle, there are many ways to do 
so, but the most natural approach is to base the estimate on OLS residuals. 
This leads to the following feasible GLS procedure, which is probably the 
most commonly-used procedure for estimating linear SUR systems. 


The first step is to estimate each of the equations by OLS. This yields consis- 
tent, but inefficient, estimates of the G;, along with g vectors of least squares 
residuals u;. The natural estimator of X is then 


y= Ó Ù, (12.17) 


where U is an n x g matrix with i** column @;. By construction, the matrix 
X is symmetric, and it will be positive definite whenever the columns of U 
are not linearly dependent. The feasible GLS estimator is given by 


BF =X 61)%.) X12 elny,; (12.18) 


and the natural way to estimate its covariance matrix is 


1 


Var(B,) = (XÈ @ In) Xe) (12.19) 
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As expected, the feasible GLS estimator (12.18) and the estimated covariance 
matrix (12.19) have precisely the same forms as their full GLS counterparts, 
which are (12.09) and (12.10), respectively. 


Because we divided by n in (12.17), © must be a biased estimator of X. 
If ki is the same for all i, then it would seem natural to divide by n — ki 
instead, and this would at least produce unbiased estimates of the diagonal 
elements. But we cannot do that when k; is not the same in all equations. 
If we were to divide different elements of U'U by different quantities, the 
resulting estimate of X would not necessarily be positive definite. 


Replacing X with an estimator £ based on OLS estimates, or indeed any 
other estimator, inevitably degrades the finite-sample properties of the GLS 
estimator. In general, we would expect the performance of the feasible GLS 
estimator, relative to that of the GLS estimator, to be especially poor when 
the sample size is small and the number of equations is large. Under the 
strong assumption that all the regressors are exogenous, exact inference based 
on the normal and y? distributions is possible whenever the error terms are 
normally distributed and X is known, but this is not the case when X has 
to be estimated. Not surprisingly, there is evidence that bootstrapping can 
yield more reliable inferences than using asymptotic theory for SUR models; 
see, among others, Rilstone and Veall (1996) and Fiebig and Kim (2000). 


Cases in which OLS Estimation is Efficient 


The SUR estimator (12.09) is efficient under the assumptions we have made, 
because it is just a special case of the GLS estimator (7.04), the efficiency of 
which was proved in Section 7.2. In contrast, the OLS estimator (12.06) is, in 
general, inefficient. The reason is that, unless the matrix X is proportional 
to an identity matrix, the error terms of equation (12.04) are not IID. Never- 
theless, there are two important special cases in which the OLS estimator is 
numerically identical to the SUR estimator, and therefore just as efficient. 


In the first case, the matrix X is diagonal, although the diagonal elements 
need not be the same. This implies that the error terms of equation (12.04) 
are heteroskedastic but serially independent. It might seem that this het- 
eroskedasticity would cause inefficiency, but that turns out not to be the case. 
If X is diagonal, then so is X~}, which means that ot! = 0 for i Æ j. In that 
case, the estimating equations (12.16) simplify to 


OM Ge XB a6, isleg 


The factors a”, which must be nonzero, have no influence on the solutions 
to the above equations, which are therefore the same as the solutions to the 
g independent sets of equations X! (y; — X;3;) = 0 which define the equation- 
by-equation OLS estimator (12.06). Thus, if the error terms are uncorrelated 
across equations, the GLS and OLS estimators are numerically identical. The 
“seemingly” unrelated equations are indeed unrelated in this case. 
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In the second case, the matrix X is not diagonal, but all the regressor matrices 
Xı through X, are the same, and are thus all equal to the matrix X that 
contains all the explanatory variables. Thus the estimating equations (12.16) 
become 


g 
X oX! (yj — 1) =O = leig: 


If we multiply these equations by Gmi, for any m between 1 and g, and sum 
over 7 from 1 to g, we obtain 


g g 
XOX omo” X" (y; - XSS) = 0. (12.20) 


i=1 j=1 


Since the omi are elements of X and the o” are elements of its inverse, it 
follows that the sum }`7_}] Omio” is equal to ômj, the Kronecker delta, which 
is equal to 1 if m = j and to 0 otherwise. Thus, for each m = 1,...,g, there is 
just one nonzero term on the left-hand side of (12.20) after the sum over i is 
performed, namely, that for which 7 = m. In consequence, equations (12.20) 
collapse to 

X"(Ym E xo) =0, 


Since these are the estimating equations that define the OLS estimator of the 
mt? equation, we conclude that BES = Bo for all m. 


A GMM Interpretation 


The above proof is straightforward enough, but it is not particularly intuitive. 
A much more intuitive way to see why the SUR estimator is identical to the 
OLS estimator in this special case is to interpret all of the estimators we have 
been studying as GMM estimators. This interpretation also provides a number 
of other insights and suggests a simple way of testing the overidentifying 
restrictions that are implicitly present whenever the SUR and OLS estimators 
are not identical. 


Consider the gl theoretical moment conditions 
E(X'(y; — X:f;)) =0, fori=1,...,9, (12.21) 


which state that every regressor, whether or not it appears in a particular 
equation, must be uncorrelated with the error terms for every equation. In 
the general case, these moment conditions are used to estimate k parameters, 
where k = yy ki. Since, in general, k < gl, we have more moment condi- 
tions than parameters, and we can choose a set of linear combinations of the 
conditions that minimizes the covariance matrix of the estimator. As is clear 
from the estimating equations (12.12), that is precisely what the SUR estima- 
tor (12.09) does. Although these estimating equations were derived from the 
principles of GLS, they are evidently the empirical counterpart of the optimal 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


12.2 Seemingly Unrelated Linear Regressions 501 


moment conditions (9.18) given in Section 9.2 in the context of GMM for the 
case of a known covariance matrix and exogenous regressors. Therefore, the 
SUR estimator is, in general, an efficient GMM estimator. 


In the special case in which every equation has the same regressors, the number 
of parameters is also equal to gl. Therefore, we have just as many parameters 
as moment conditions, and the empirical counterpart of (12.21) collapses to 


X'(y; — XB;) =0, fori=1,...,9, 


which are just the moment conditions that define the equation-by-equation 
OLS estimator. Each of these g sets of equations can be solved for the l para- 
meters in 3;, and the unique solution is BOY. 


We can now see that the two cases in which OLS is efficient arise for two quite 
different reasons. Clearly, no efficiency gain relative to OLS is possible unless 
there are more moment conditions than the OLS estimator utilizes. In other 
words, there can be no efficiency gain unless gl > k. In the second case, OLS 
is efficient because gl = k. In the first case, there are in general additional 
moment conditions, but, because there is no contemporaneous correlation, 
they are not informative about the model parameters. 


We now derive the efficient GMM estimator from first principles and show 
that it is identical to the SUR estimator. We start from the set of gl sample 
moments 

(I, 8 X) (27 @I1n) (Ye — Xepe). (12.22) 


These provide the sample analog, for the linear SUR model, of the left-hand 
side of the theoretical moment conditions (9.18). The matrix in the middle 
is the inverse of the covariance matrix of the stacked vector of error terms. 
Using the second result in (12.08), expression (12.22) can be rewritten as 


(2-1 e X")(y. — Xebe). (12.23) 
The covariance matrix of this gl—vector is 
(2-1 @X')(L@elL,)(2'e@X)=T1@X'X, (12.24) 


where we have made repeated use of the second result in (12.08). Combining 
(12.23) and (12.24) to construct the appropriate quadratic form, we find that 
the criterion function for fully efficient GMM estimation is 


(ye — X.G.)'(2* @ X)(V@ (XX) (E @X")(y. — XH.) 
= (Ye — Xebe) (X ® Px) (ys — Xebe), (12.25) 


where, as usual, Px is the hat matrix, which projects orthogonally on to the 
subspace spanned by the columns of X. 
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It is not hard to see that the vector BEMM which minimizes expression (12.25) 
must be identical to BCLS. The first-order conditions may be written as 


g 
X oX] Px (yj — Xj;B0™™) = 0. (12.26) 
j=l 


But since each of the matrices X; lies in 58(X), it must be the case that 
Px X; = X;, and so conditions (12.26) are actually identical to conditions 
(12.16), which define the GLS estimator. 


Since the GLS, and equally the feasible GLS, estimator can be interpreted 
as efficient GMM estimators, it is natural to test the overidentifying restric- 
tions that these estimators depend on. These are the restrictions that certain 
columns of X do not appear in certain equations. The usual Hansen-Sargan 
statistic, which is just the minimized value of the criterion function (12.25), 
will be asymptotically distributed as y?(gl — k) under the null hypothesis. 
As usual, the degrees of freedom for the test is equal to the number of mo- 
ment conditions minus the number of estimated parameters. Investigators 
should always report the Hansen-Sargan statistic whenever they estimate a 
multivariate regression model by feasible GLS. 


Since feasible GLS is really a feasible efficient GMM estimator, we might 
prefer to use the continuously updated GMM estimator, which was introduced 
in Section 9.2. Although the latter estimator is asymptotically equivalent 
to the one-step feasible GMM estimator, it may have better properties in 
finite samples. In this case, the continuously updated estimator is simply 
iterated feasible GLS, and it works as follows. After obtaining the feasible GLS 
estimator (12.18), we use it to recompute the residuals. These are then used 
in the formula (12.17) to obtain an updated estimate of the contemporaneous 
covariance matrix X, which is then plugged back into the formula (12.18) to 
obtain an updated estimate of Be. This procedure may be repeated as many 
times as desired. If the procedure converges, then, as we will see shortly, 
the estimator that results is equal to the ML estimator computed under the 
assumption of normal error terms. 


Determinants of Square Matrices 


The most popular alternative to feasible GLS estimation is maximum like- 
lihood estimation under the assumption that the error terms are normally 
distributed. We will discuss this estimation method in the next subsection. 
However, in order to develop the theory of ML estimation for systems of 
equations, we must first say a few words about determinants. 


A p x p square matrix A defines a mapping from Euclidean p—dimensional 
space, E?, into itself, by which a vector x € EP is mapped into the p-—vector 
Az. The determinant of A is a scalar quantity which measures the extent to 
which this mapping expands or contracts p—dimensional volumes in E?. 
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a2 a2 
Mya2 
ay ai 
O O 
(a) The parallelogram defined (b) Rectangle of equal area formed 
by a, and ag with a; and Ma2 


Figure 12.1 Determinants in two dimensions 


Consider a simple example in E?. Volume in 2-dimensional space is just area. 
The simplest area to consider is the unit square, which can be defined as the 
parallelogram defined by the two unit basis vectors e; and e2, where e; has 
only one nonzero component, in position 7. The area of the unit square is, by 
definition, 1. The image of the unit square under the mapping defined by a 
2 x 2 matrix A is the parallelogram defined by the two columns of the matrix 


Ale: e2]=AI=A=[aq az], 


where a; and ax are the two columns of A. The area of a parallelogram in 
Euclidean geometry is given by base times height, where the length of either 
one of the two defining vectors can be taken as the base, and the height is then 
the perpendicular distance between the two parallel sides that correspond to 
this choice of base. This is illustrated in Figure 12.1. 


If we choose a, as the base, then, as we can see from the figure, the height is 
the length of the vector M,a2, where M, is the orthogonal projection on to 
the orthogonal complement of aı. Thus the area of the parallelogram defined 
by a, and a2 is ||aı||||Mıa2||. By use of Pythagoras’ Theorem and a little 
algebra (see Exercise 12.6), it can be seen that 


lail| || Mia@2|| = |a11¢22 — ai2a@21], (12.27) 


where a;; is the ijt element of A. This quantity is the absolute value of 
the determinant of A, which we write as |det A|. The determinant itself, 
which is defined as a11đa22 — a12021, can be of either sign. Its signed value 
can be written as “det A”, but it is more commonly, and perhaps somewhat 
confusingly, written as |A]. 


Algebraic expressions for determinants of square matrices of dimension higher 
than 2 can be found easily enough, but we will have no need of them. We 
will, however, need to make use of some of the properties of determinants. 
The principal properties that will matter to us are as follows. 


e The determinant of the transpose of a matrix is equal to the determinant 
of the matrix itself. That is, |A! | = |A]. 
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e The determinant of a triangular matrix is the product of its diagonal 
elements. 


e Since a diagonal matrix can be regarded as a special triangular matrix, 
its determinant is also the product of its diagonal elements. 


e Since an identity matrix is a diagonal matrix with all diagonal elements 
equal to unity, the determinant of an identity matrix is 1. 


e If a matrix can be partitioned so as to be block-diagonal, then its deter- 
minant is the product of the determinants of the diagonal blocks. 


e Interchanging two rows, or two columns, of a matrix leaves the absolute 
value of the determinant unchanged but changes its sign. 


e The determinant of the product of two square matrices of the same di- 
mensions is the product of their determinants, from which it follows that 
the determinant of A7! is the reciprocal of the determinant of A. 


e Ifa matrix can be inverted, its determinant must be nonzero. Conversely, 
if a matrix is singular, its determinant is 0. 


e The derivative of log |A| with respect to the ijt® element a;; of A is the 
jit element of Aq! 


Maximum Likelihood Estimation 


If we assume that the error terms of an SUR system are normally distributed, 
the system can be estimated by maximum likelihood. The model to be esti- 
mated can be written as 


Ye = Xebe + Ue, uw~N(0,S @l,). (12.28) 


The loglikelihood function for this model is the logarithm of the joint density 
of the components of the vector Yə. In order to derive that density, we must 
start with the density of the vector ue. 


Up to this point, we have not actually written down the density of a random 
vector that follows the multivariate normal distribution. We will do so in 
a moment. But first, we state a more fundamental result, which extends 
the result (10.92) that was proved in Section 10.8 for univariate densities of 
transformations of variables to the case of multivariate densities. 


Let z be a random m-vector with known density f,(z), and let a be another 
random m-vector such that z = h(a), where the deterministic function h(-) 
is a one to one mapping of the support of the random vector æ, which is a 
subset of R™, into the support of z. Then the multivariate analog of the result 
(10.92) is 

falx) = fz(h(x))|det J(x)|, (12.29) 


where J(x) = Oh(x)/Ox is the Jacobian matrix of the transformation, that 
is, the m x m matrix containing the derivatives of the components of h(a) 
with respect to those of x. 
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Using (12.29), it is not difficult to show that, if the m x 1 vector z follows the 
multivariate normal distribution with mean vector 0 and covariance matrix #2, 
then its density is equal to 


(2m)-™/?| R" exp(—z'Q-12). (12.30) 


Readers are asked to prove a slightly more general result in Exercise 12.8. 


For the system (12.28), the function h(-) that gives ue as a function of Ye is 
the right-hand side of the equation 


Ue = Ye — Xe ße. (12.31) 


Thus we see that, if there are no lagged dependent variables in the matrix Xe, 
then the Jacobian of the transformation is just the identity matrix, of which 
the determinant is 1. 


The Jacobian will, in general, be much more complicated if there are lagged 
dependent variables, because the elements of Xe will depend on the elements 
of ye. However, as readers are invited to check in Exercise 12.10, even though 
the Jacobian is, in such a case, not equal to the identity matrix, its determi- 
nant is still 1. Therefore, we can ignore the Jacobian when we compute the 
density of ye. When we substitute (12.31) into (12.30), as the result (12.29) 
tells us to do, we find that the density of ye is (27)~9"/? times 


| @ In| exp(—= (ye — Xepe) (X7! @In) (ye — Xee)). (12.32) 


1 
2 
Jointly maximizing the logarithm of this function with respect to 6e and the 
elements of X gives the ML estimator of the SUR system. 


The argument of the exponential function in (12.32) plays the same role for a 
multivariate linear regression model as the sum of squares term plays in the 
loglikelihood function (10.10) for a linear regression model with IID normal 
errors. In fact, it is clear from (12.32) that maximizing the loglikelihood with 
respect to Be for a given X is equivalent to minimizing the function 


(Ye ~~ Xb) (S Q In) (Ye = Xebe) 


with respect to Be. This expression is just the criterion function (12.11) that 
is minimized in order to obtain the GLS estimator (12.09). Therefore, the 
ML estimator QML must have exactly the same form as (12.09), with the 
matrix X replaced by its ML estimator Sac which we will derive shortly. 


It follows from (12.32) that the loglikelihood function ¢(’, Be) for the model 
(12.28) can be written as 


— Flog 2m = 5 log | & I, | = (Ye = DACA a, @I1,)(ye = Xebe). 
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The properties of determinants set out in the previous subsection can be used 
to show that the determinant of X @I,, is |X|”; see Exercise 12.11. Thus this 
loglikelihood function simplifies to 


-Z log 2m — $log |E| — 5 (Ye — X.) (57 8 In)(ye — XB). (12.33) 


We have already seen how to maximize the function (12.33) with respect to Be 
conditional on X. Now we want to maximize it with respect to X. 


Maximizing (X, Be) with respect to X is of course equivalent to maximizing it 
with respect to /’—!, and it turns out to be technically simpler to differentiate 
with respect to the elements of the latter matrix. Note first that, since the 
determinant of the inverse of a matrix is the reciprocal of the determinant 
of the matrix itself, we have —log |X| = log|'~"|, so that we can readily 
express all of (12.33) in terms of ©~! rather than X. 


It is obvious that the derivative of any p x q matrix A with respect to its ijt? 
element is the p x q matrix E;;, all the elements of which are 0, except for 
the ijt", which is 1. Recall that we write the ij** element of Xt as oJ. We 
therefore find that 

oat 

ðcii a uo 
where in this case E;; is a g x g matrix. We remarked in our earlier discussion 
of determinants that the derivative of log |A| with respect to aj; is the jit” 
element of A~‘. Armed with this result and (12.34), we see that the derivative 
of the loglikelihood function ¢(7, Be) with respect to the element gt! is 


(12.34) 


UE, Bs) n 
= ) = Dij 5 (Ue X.Be)' (Ei; & In) (Yeo = Xebe). (12.35) 


The Kronecker product E;; & I, has only one nonzero block containing I,,. It 
is easy to conclude from this that 


(Yo os X. Be) | (Ei ® Tn) (Ue = Xe) = (yi a X; Bi) (y; = X;ß;). 


By equating the partial derivative (12.35) to zero, we find that the ML esti- 


mator oe is 


7 1 A 7 
= (ue Xi BS") (y; — Xj 85"). 


If we define the n x g matrix U(,) to have it? column y; — X;ßi, then we 
can conveniently write the ML estimator of X as follows: 


Sui = = UA") U (BY). (12.36) 


This looks like equation (12.17), which defines the covariance matrix used in 
feasible GLS estimation. Equations (12.36) and (12.17) have exactly the same 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


12.2 Seemingly Unrelated Linear Regressions 507 


form, but they are based on different matrices of residuals. Equation (12.36) 
and equation (12.09) evaluated at XML, that is 


AM = (xT (Sy @In)X.) AA @In)ye, (12.37) 


together define the ML estimator for the model (12.28). 


Equations (12.36) and (12.37) are exactly the ones that are used by the con- 
tinuously updated GMM estimator to update the estimates of X and 6e, 
respectively. It follows that, if the continuous updating procedure converges, 
it converges to the ML estimator. Consequently, we can estimate the covar- 
iance matrix of QML in the same way as for the GLS or GMM estimator, by 
the formula Sn 

Var(BM™) = (XSTL 2 In) X.J `. (12.38) 


It is also possible to estimate the covariance matrix of the estimated con- 
temporaneous covariance matrix, XML, although this is rarely done. If the 
elements of X are stacked in a vector of length g?, a suitable estimator is 


Var (SA) = 2 5(62") @ E(B"). (12.39) 


Notice that the estimated variance of any diagonal element of X is just twice 
the square of that element, divided by n. This is precisely what is obtained 
for the univariate case in Exercise 10.10. As with that result, the asymptotic 
validity of (12.39) depends critically on the assumption that the error terms 
are multivariate normal. 


As we saw in Chapter 10, ML estimators are consistent and asymptotically 
efficient if the underlying model is correctly specified. It may therefore seem 
that the asymptotic efficiency of the ML estimator (12.37) depends critically 
on the multivariate normality assumption. However, the fact that the ML esti- 
mator is identical to the continuously updated efficient GMM estimator means 
that it is in fact efficient in the same sense as the latter. When the errors are 
not normal, the estimator is more properly termed a QMLE (see Section 10.4). 
As such, it is consistent, but not necessarily efficient, under assumptions about 
the error terms that are no stronger than those needed for feasible GLS to be 
consistent. Moreover, if the stronger assumptions made in (12.02) hold, even 
without normality, then the estimator (12.38) of Var(GM") is asymptotically 
valid. If the error terms are not normal, it would be necessary to have infor- 
mation about their actual distribution in order to derive an estimator with a 
smaller asymptotic variance than (12.37). 


It is of considerable theoretical interest to concentrate the loglikelihood func- 
tion (12.33) with respect to X. In order to do so, we use the first-order condi- 
tions that led to (12.36) to define ©'(G,) as the matrix that maximizes (12.33) 
for given Be. We find that 


E(B.) = —U'(B.)U (Ae). 


1 
n 
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A calculation of a type that should now be familiar then shows that 


(ye — X. pe) (57 @In)(ye — Xo.) 


g g 
= > se o" (yi — X;8i)'(y; — X; B3). 
i=1 j=1 


When o”/ = ot (Be), which denotes the ijt® element of ©~!(G,), the right- 
hand side of equation (12.40) is 


D DPO (TB) U(B.)),5 =n D (Bo) 715(Be) 


i=1 j=1 


(12.40) 


where we have made use of the trace operator, which sums the diagonal ele- 
ments of a square matrix; see Section 2.6. By substituting this result into 
expression (12.33), we see that the concentrated loglikelihood function can be 
written as 


-Z (log 2m + 1) — 2 log|—U"(8.) U(B.)}- (12.41) 


This expression depends on the data only through the determinant of the 
covariance matrix of the residuals. It is the multivariate generalization of the 
concentrated loglikelihood function (10.11) that we obtained in Section 10.2 
in the univariate case. We saw there that the concentrated function depends 
on the data only through the sum of squared residuals. 


It is quite possible to minimize the determinant in (12.41) with respect to Be 
directly. It may or may not be numerically simpler to do so than to solve the 
coupled equations (12.37) and (12.36). 


We saw in Section 3.6 that the squared residuals of a univariate regression 
model tend to be smaller than the squared error terms, because least squares 
estimates make the sum of squared residuals as small as possible. For a similar 
reason, the residuals from ML estimation of a multivariate regression model 
tend to be too small and too highly correlated with each other. We observe 
both effects, because the determinant of X can be made smaller either by 
reducing the sums of squared residuals associated with the individual equa- 
tions or by increasing the correlations among the residuals. This is likely to 
be most noticeable when g and/or the k; are large relative to n. 


Although feasible GLS and ML with the assumption of normally distributed 
errors are by far the most commonly used methods of estimating linear SUR 
systems, they are by no means the only ones that have been proposed. For 
fuller treatments, a classic reference on linear SUR systems is Srivastava and 
Giles (1987), and a useful recent survey paper is Fiebig (2001). 
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12.3 Systems of Nonlinear Regressions 


Many multivariate regression models are nonlinear. For example, economists 
routinely estimate demand systems, in which the shares of consumer expen- 
diture on various classes of goods and services are explained by incomes, 
prices, and perhaps other explanatory variables. Demand systems may be 
estimated using aggregate time-series data, cross-section data, or mixed time- 
series/cross-section (panel) data on households.' 


The multivariate nonlinear regression model is a system of nonlinear regres- 
sions which can be written as 


Y = ul b) F Uns CH Tete My T= lpg. (12.42) 
Here y; is the tt” observation on the it dependent variable, Ztl B) is the i 
observation on the regression function which determines the conditional mean 
of that dependent variable, B is a k-vector of parameters to be estimated, 
and u+ is an error term which is assumed to have mean zero conditional 
on all the explanatory variables that implicitly appear in all the regression 
functions z+; (6), j =1,...,g. In the demand system case, y+; would be the 
share of expenditure on commodity i for observation t, and the explanatory 
variables would include prices and income. We assume that the error terms 
in (12.42), like those in (12.01), satisfy assumption (12.02). They are serially 
uncorrelated, homoskedastic within each equation, and have contemporaneous 
covariance matrix X with typical element gij. 


The equations of the system (12.42) can also be written using essentially 
the same notation as we used for univariate nonlinear regression models in 
Chapter 6. If, for each i = 1,...,g, the n-vectors yi, 2;(G), and u; are 
defined to have typical elements yti, Uti, and x4;(G), respectively, then the 
entire system can be expressed as 


Yi = xıl B) + ui, E(w;uj! ) = Gij ln; 1,9 E 1,...,g. (12.43) 


We have written (12.42) and (12.43) in such a way that there is just a single 
vector of parameters, denoted 8. Every individual parameter may, at least in 
principle, appear in every equation, although that is rare in practice. In the 
demand systems case, however, some but not all of the parameters typically do 
appear in every equation of the system. Thus systems of nonlinear regressions 
very often involve cross-equation restrictions. 


Multivariate nonlinear regression models can be estimated in essentially the 
same way as the multivariate linear regression model (12.01). Feasible GLS 


1 The literature on demand systems is vast; see, among many others, Christensen, 
Jorgenson, and Lau (1975), Barten (1977), Deaton and Muellbauer (1980), 
Pollak and Wales (1981, 1987), Browning and Meghir (1991), Lewbel (1991), 
and Blundell, Browning, and Meghir (1994). 
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and maximum likelihood are both commonly used. The results we obtained 
in the previous section still apply, provided they are modified to allow for the 
nonlinearity of the regression functions and for cross-equation restrictions. 
Our discussion will therefore be quite brief. 


Estimation 


We saw in Section 7.3 that nonlinear GLS estimates can be obtained either by 
minimizing the criterion function (7.13) or, equivalently, by solving the set of 
first-order conditions (7.14). For the multivariate nonlinear regression model 
(12.42), the criterion function can be written so that it looks very much like 
expression (12.11). Let ye once again denote a gn-vector of the y; stacked 
vertically, and let £e( 8B) denote a gn—vector of the x;() stacked in the same 
way. The criterion function (7.13) then becomes 


(Ye — Le (BS 8 In) (Ye — Le (8)). (12.44) 


Minimizing (12.44) with respect to 8 yields nonlinear GLS estimates which, 
by the results of Section 7.2, are consistent and asymptotically efficient under 
standard regularity conditions. 


The first-order conditions for the minimization of (12.44) give rise to the 


following moment conditions, which have a very similar form to the moment 
conditions (12.12) that we found for the linear case: 


XNBE- @ In) (ye — £e (8)) = 0. (12.45) 


Here, the gn x k matrix X,(3) is a matrix of partial derivatives of the zu(8). 
If the n x k matrices X;(@) are defined, just as in the univariate case, so 
that the tj" element of X;(6) is Əxzu(6)/3B;, fort =1,...,n, j = 1,...,k, 
then X,() is the matrix formed by stacking the X;(@) vertically. Except in 
the special case in which each parameter appears in only one equation of the 
system, X.(3) does not have the block-diagonal structure of Xe in (12.05). 


Despite this fact, it is not hard to show that the moment conditions (12.45) 
can be expressed in a compact form like (12.16), but with a double sum. As 
readers are asked to check in Exercise 12.12, we obtain estimating equations 
of the form 


>> 7 Xi (8) (yj — #;(8)) = 0. (12.46) 


i=1 j=1 


The vector SLS that solves these equations is the nonlinear GLS estimator. 


Adapting expression (7.05) to the model (12.43) gives the standard estimate 
of the covariance matrix of the nonlinear GLS estimator, namely, 


vae (OTa aR (12.47) 
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This can also be written (see Exercise 12.12 again) as 


=i 
Var (BSTS) = (Soyo xX (Boh?) X. )x;(6%5) i (12.48) 


1=1 j=1 


Feasible GLS estimation works in essentially the same way for nonlinear mul- 
tivariate regression models as it does for linear ones. The individual equations 
of the system are first estimated separately by either ordinary or nonlinear 
least squares, as appropriate. The residuals are then grouped into an n x g 
matrix U, and equation (12.17) is used to obtain the estimate Š. We can 
then replace X by $ in the GLS criterion function (12.44) or in the moment 
conditions (12.45) to obtain the feasible GLS estimator GF. We may also 
use a continuously updated estimator, alternately updating our estimates of 
B and X. If this iterated feasible GLS procedure converges, we will have 
obtained ML estimates, although there may well be more computationally 
attractive ways to do so. 


Maximum likelihood estimation under the assumption of normality is very 
popular for multivariate nonlinear regression models. For the system (12.42), 
the loglikelihood function can be written as 


-T log2r — Flog |E| — $ (ye — £e(8)) (57 @ In) (ye — £e(8)). (12.49) 


This is the analog of the loglikelihood function (12.33) for the linear case. 
Maximizing (12.49) with respect to 8 for given X is equivalent to minimizing 
the criterion function (12.44) with respect to 3, and so the first-order condi- 
tions are equations (12.45). Maximizing (12.49) with respect to X for given 8 
leads to first-order conditions that can be written as 


E(B) = —U(8)U(B), 


in exactly the same way as the maximization of (12.33) with respect to X 
led to equation (12.36). Here the n x g matrix U (8) is defined so that its 
it column is y; — x;(8). 


Thus the estimating equations that define the ML estimator are 


XJ (8M) (Sup @ In) (ye — 2.(8")) =0, and 


H i . (12.50) 
XML = =U (pM) U(B™M"). 
As in the linear case, these are also the estimating equations for the continu- 
ously updated GMM estimator. The covariance matrix of BML is, of course, 
given by either of the formulas (12.47) or (12.48) evaluated at 6M" and yuz. 
The loglikelihood function concentrated with respect to X can be written, 
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just like expression (12.41), as 


- 2 (log 2m + 1) — Blog|£UT(8)U(A)|. (12.51) 


As in the linear case, it may or may not be numerically easier to maximize the 
concentrated function directly than to solve the estimating equations (12.50). 


The Gauss-Newton Regression 


The Gauss-Newton regression can be very useful in the context of multivariate 
regression models, both linear and nonlinear. The starting point for setting 
up the GNR for both types of multivariate model is equation (7.15), the GNR 
for the standard univariate model y = «(@) + u, with Var(u) = Q. This 
GNR takes the form 


W'(y —x(3)) = 0'X(B)b + residuals, 


where, as usual, X(@) is the matrix of partial derivatives of the regression 
functions, and W is such that WW! = RH. 


Expressed as a univariate regression, the multivariate model (12.43) becomes 
Ye = ToB) + Uus, Var(ue)= X 81n. (12.52) 
If we now define the g x g matrix W such that WW! = Y—!, it is clear that 
(Z @1,)(¥ 21n) = (F 9 I (P @1,) = (PP @I1,)= X! @l,, 


where the last expression is the inverse of the covariance matrix of Ue. 
From (7.15), the GNR corresponding to (12.52) is therefore 


(P' Q In) (ye — £.(8)) = (W' 2 In)X.(8)b + residuals. (12.53) 


The gn x k matrix X. (6) is the matrix of partial derivatives that we already 
defined for use in the moment conditions (12.45). Observe that, as required 
for a properly defined artificial regression, the inner product of the regressand 
with the matrix of regressors yields the left-hand side of the moment condi- 
tions (12.45), and the inverse of the inner product of the regressor matrix with 
itself has the same form as the covariance matrix (12.47). 


The Gauss-Newton regression (12.53) can be useful in a number of contexts. 
It provides a convenient way to solve the estimating equations (12.45) in 
order to obtain an estimate of @ for given X, and it automatically computes 
the covariance matrix estimate (12.47) as well. Because feasible GLS and 
ML estimation are algebraically identical as regards the estimation of the 
parameter vector 3, the GNR is useful in both contexts. In practice, it is 
frequently used to calculate test statistics for restrictions on 3; see Section 6.7. 
Another important use is to impose cross-equation restrictions after equation- 
by-equation estimation. For this purpose, the multivariate GNR is just as 
useful for linear systems as for nonlinear ones; see Exercise 12.13. 
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12.4 Linear Simultaneous Equations Models 


In Chapter 8, we dealt with instrumental variables estimation of a single 
equation in which some of the explanatory variables are endogenous. As we 
noted there, it is necessary to have information about the data-generating 
process for all of the endogenous variables in order to determine the optimal 
instruments. However, we actually dealt with only one equation, or at least 
only one equation at a time. The model that we consider in this section and 
the next, namely, the linear simultaneous equations model, extends what we 
did in Chapter 8 to a model in which all of the endogenous variables have the 
same status. Our objective is to obtain efficient estimates of the full set of 
parameters that appear in all of the simultaneous equations. 


The Model 


The it equation of a linear simultaneous system can be written as 
Yi = Xi Bi + ui = Zi Bui + Yi Bai + ui, (12.54) 


where X; is an n x k; matrix of explanatory variables that can be partitioned 
as X; = |Z; Yj]. Here Z; is an n x ky; matrix of variables that are assumed 
to be exogenous or predetermined, and Y; is an n x ko; matrix of endogenous 
variables, with ki; + ko; = ki. The k;-vector B; of parameters can be parti- 
tioned as [G1; | G2;| to conform with the partitioning of X. The g endogenous 
variables yı through yg are assumed to be jointly generated by g equations of 
the form (12.54). The number of exogenous or predetermined variables that 
appear anywhere in the system is l. This implies that kı; < l for all i.? 


We make the standard assumption (12.02) about the error terms. Thus we 
allow for contemporaneous correlation, but not for heteroskedasticity or serial 
correlation. It is, of course, quite possible to allow for these extra complica- 
tions, but they are are not admitted in the context of the model currently 
under discussion, which thus has a distinctly classical flavor, as befits a model 
that has inspired a long and distinguished literature. 


Except for the explicit distinction between endogenous and predetermined ex- 
planatory variables, equation (12.54) looks very much like the typical equation 
(12.01) of an SUR system. However, there is one important difference, which 
is concealed by the notation. It is that, as with the simple demand-supply 
model of Section 8.2, the dependent variables y; are not necessarily distinct. 


2 Readers should be warned that the notation we have introduced in equation 
(12.54) is not universal. In particular, some authors reverse the definitions of 
Xi and Z; and then define X to be the n x l matrix of all the exogenous 
and predetermined variables, which we will denote below by W. Our notation 
emphasizes the similarities between the linear simultaneous equations model 
(12.54) and the linear SUR system (12.01), as well as making it clear that W 
plays the role of a matrix of instruments. 
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Since equations (12.54) form a simultaneous system, it is arbitrary which one 
of the endogenous variables is put on the left-hand side with a coefficient of 1, 
at least in any equation in which more than one endogenous variable appears. 
It is a matter of simple algebra to select one of the variables in the matrix Y;, 
take it over to the left-hand side while taking y; over to the right, and then 
rescale the coefficients so that the selected variable has a coefficient of 1. This 
point can be important in practice. 


Just as we did with the linear SUR model, we can convert the system of 
equations (12.54) to a single equation by stacking them vertically. As before, 
the gn-vectors Yə and Ue consist of the y; and the u;, respectively, stacked 
vertically. The gn x k matrix Xe, where k = kı +... + kg, is defined to be 
a block-diagonal matrix with diagonal blocks X;, just as in equation (12.05). 
The full system can then be written as 


Yo = Xele + Us, E(uste ) =X 9 l (12.55) 


where the k-vector 3, is formed by stacking the 6; vertically. As before, the 
g X g matrix X is the contemporaneous covariance matrix of the error terms. 
The true value of 3, will be denoted 82. 


Efficient GMM Estimation 


One of the main reasons for estimating a full system of equations is to obtain 
an efficiency gain relative to single-equation estimation. In Section 9.2, we 
saw how to obtain the most efficient possible estimator for a single equation in 
the context of efficient GMM estimation. The theoretical moment conditions 
that lead to such an estimator are given in equation (9.18), which we rewrite 
here for easy reference: 


E(X'Q7*(y— XB)) = 0. (9.18) 


Because we are assuming that there is no serial correlation, these moment 
conditions are also valid for the linear simultaneous equations model (12.54). 
We simply need to reinterpret them in terms of that model. 


In reinterpreting the moment conditions (9.18), it is clear that y. will replace 
the vector y, Xe ße will replace the vector XB, and X71 Q I, will replace the 
matrix Rt. What is not quite so clear is what will replace X. Recall that X 
in (9.18) is the matrix defined row by row so as to contain the expectations of 
the explanatory variables for each observation conditional on the information 
that is predetermined for that observation. We need to obtain the matrix that 
corresponds to X in equation (9.18) for the model (12.55). 


Let W denote an n x | matrix of exogenous and predetermined variables, the 
columns of which are all of the linearly independent columns of the Z;. For 
these variables, the expectations conditional on predetermined information 
are just the variables themselves. Thus we only need worry about the endo- 
genous explanatory variables. Because their joint DGP is given by the system 
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of linear equations (12.54), it must be possible to solve these equations for 
the endogenous variables as functions of the predetermined variables and the 
error terms. Since these equations are linear and have the same form for all 
observations, the solution must have the form 


Yi = Wr; + error terms, (12.56) 


where 7; is an /—vector of parameters that are, in general, nonlinear functions 
of the parameters Be. As the notation indicates, the variables contained in 
the matrix W serve as instrumental variables for the estimation of the model 
parameters. Later, we will investigate more fully the nature of the 7;. We 
pay little attention to the error terms, because our objective is to compute 
the conditional expectations of the elements of the y;, and we know that each 
of the error terms must have expectation 0 conditional on all the exogenous 
and predetermined variables. 


The vector of conditional expectations of the elements of y; is just Wz. Since 
equations (12.56) take the form of linear regressions with exogenous and pre- 
determined explanatory variables, OLS estimates of the m; are consistent. As 
we saw in Section 12.2, they are also efficient, even though the error terms 
will generally display contemporaneous correlation, because the same regres- 
sors appear in every equation. Thus we can replace the unknown 7; by their 
OLS estimates based on equations (12.56). This means that the conditional 
expectations of the vectors y; are estimated by the OLS fitted values, that 
is, the vectors Wz; = Pwy;. When this is done, the matrices that contain 
the estimates of the conditional expectations of the elements of the X; can be 
written as 


A 


X,=(Z% PwVil=PwlZ Yi) = PwXi. (12.57) 


We write x, rather than X; because the unknown conditional expectations 
are estimated. The step from the second to the third expression in (12.57) is 
possible because all the columns of all the Z; are, by construction, contained 
in the span of the columns of W. 


We are now ready to construct the matrix to be used in place of x in (9.18). 
It is the block-diagonal gn x k matrix Xe, with diagonal blocks the X;. This 
allows us to write the estimating equations for efficient GMM estimation as 


xs ® I,)(Ye = Xebe) =0. (12.58) 


These equations, which are the empirical versions of the theoretical moment 
conditions (9.18), can be rewritten in several other ways. In particular, they 
can be written in the form 
ot! X Pw tee o!9 X! Pw Yı — Xıßı 
. . i = 


oX} Pw nih 099 X | Pw Yg — Kale 
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by analogy with equation (12.13), and in the form 


g 
50X Pw(y; — X;jß;)=0, i=1,...,9, (12.59) 
j=l 


by analogy with equation (12.16). It is also straightforward to check (see 
Exercise 12.14) that they can be written as 


X! (X7! @ Pw)(ye — Xe) = 0, (12.60) 


from which it follows immediately that equations (12.58) are equivalent to the 
first-order conditions for the minimization of the criterion function 


(ye — Xebe) (57! @ Pw) (ye — X. ße). (12.61) 


The efficient GMM estimator CMM defined by (12.60) is the analog for a 
linear simultaneous equations system of the GLS estimator (12.09) for an 
SUR system. 


The asymptotic covariance matrix of BoMM can readily be obtained from 


expression (9.29). In the notation of (12.58), we find that 


‘ a ~ ~l 
Var ( plim (Bo B®) = plim (HRI @ T,) Xe) . (12.62) 


n— oo n— Co 


This covariance matrix can also be written, in the notation of (12.60), as 


Zi 
plim (Exa @ Pw)X.) (12.63) 


Of course, the estimator BoM is not feasible if, as is almost always the case, 


the matrix X is unknown. However, it is obvious that we can deal with this 
problem by using a procedure analogous to feasible GLS estimation of an SUR 
system. We will return to this issue at the end of this section. 


Two Special Cases 


If the matrix X is diagonal, then equations (12.59) simplify to 
o” Xð Pw(yi — Xi8;)=0, i=1,...,9. (12.64) 


The factors of a” have no influence on the solutions to these equations, which 
are therefore just the generalized IV, or 2SLS, estimators for each of the 
equations of the system treated individually, with a common matrix W of 
instrumental variables. This result is the analog of what we found for an SUR 
system with diagonal X. Here it is the equation-by-equation IV estimator 
that takes the place of the equation-by-equation OLS estimator. 
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Just as single-equation OLS estimation is consistent but in general inefficient 
for an SUR system, so is single-equation IV estimation consistent but in gen- 
eral inefficient for the linear simultaneous equations model. As readers are 
asked to verify in Exercise 12.15, the estimating equations (12.64), without 
the factors of o”, can be rewritten for the entire system as 


X. (L, ® Pw)(ye = Xebe) =0. (12.65) 


In general, solving equations (12.65) yields an inefficient estimator unless the 
true contemporaneous covariance matrix X is diagonal. 


There is, however, another case in which the estimating equations (12.65) 
yield an asymptotically efficient estimator. This case is analogous to the case 
of an SUR system with the same explanatory variables in each equation, but 
it takes a rather different form in this context. What we require is that each 
of the equations in the system should be just identified. 


When we say that a single equation is just identified by an IV estimator, part 
of what we mean is that the number of instruments is equal to the number of 
explanatory variables, or, equivalently for a linear regression, to the number 
of parameters. If equation 7 is just identified, therefore, the two matrices W 
and PwX; have the same dimensions. In fact, they span the same linear 
subspace provided that Pw X; is of full column rank. Consequently, there 
exists an l x 1 matrix J; such that Py X;J; = W. Premultiplying the it! 
equation of (12.59) by J; thus gives 


g 
X ot W "(yj — X;p;) = 0. 
j=l 


If all the equations of a simultaneous equations system are just identified, 
then the above relation holds for each i = 1,...,g. We can then multiply 
equation i by om; and sum over i, as in equation (12.20). This yields the 
decoupled estimating equations 


Win — mm) = 0, m=1,...,9, 


which define the single-equation (simple) IV estimators in the just-identified 
case. Therefore, as with the SUR model, there is no advantage to system 
estimation rather than equation-by-equation estimation when every equation 
is just identified, because the estimating equations use up all of the available 
moment conditions. 


Identification 


In order to be able to solve the estimating equations (12.60) for 6B., it must 
be possible to invert the matrix 


X! (X7! @ Pw) Xe. (12.66) 
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Thus, in finite samples, the parameters of the model (12.55) will be identified 
if this matrix is nonsingular. Although this statement is accurate, it is neither 
complete nor transparent. In particular, even if the matrix (12.66) is singular, 
it may still be possible to identify some of the parameters. 


Whenever the contemporaneous covariance matrix X is nonsingular, it can 
be shown, as spelt out in Exercise 12.16, that the matrix (12.66) is singular if 
and only if at least one of the matrices Pw_X; does not have full column rank. 
In other words, the system of equations is unidentified if and only if at least 
one of its component equations is unidentified. The result of the exercise also 
shows that the parameters of those equations for which Py_X; does have full 
column rank can be identified uniquely by the estimating equations (12.60). 
In consequence, provided that X is nonsingular, we can study identification 
equation by equation without loss of generality. 


A necessary condition for PwX; to have full column rank is that l, the number 
of instruments contained in the matrix W, should be no less than k;, the 
number of explanatory variables contained in X;. This condition is called the 
order condition for identification of equation 7. It is an accounting condition, 
and, as such, can be expressed in more than one way. Recall that we defined 
kı; as the number of exogenous or predetermined explanatory variables in X;, 
that is, the dimension of the matrix Z;. Since the total number of exogenous 
or predetermined variables in the full system is l, the number of such variables 
excluded from equation i is | — ky;. The number of endogenous explanatory 
variables included in equation i is, by definition k2;, which is the dimension 
of the matrix Y;. Therefore, the inequality | > k; is equivalent to 


l > kaj + koi or l— kii = koi. (12.67) 


The second inequality here says that the number of predetermined variables 
excluded from an equation must be at least as great as the number of endo- 
genous explanatory variables in that equation. 


The necessary and sufficient condition for the identification of the parameters 
of equation 7 is that PwX; should have full column rank of k;. This condition, 
which is, not surprisingly, called the rank condition for identification, will 
hold whenever the k; x k; matrix X} PwX; is nonsingular. It is easy to check 
whether the rank condition holds for any given data set. However, it is not so 
easy to check whether it holds asymptotically. The problem is that, because 
some of the columns of X; are endogenous, plimn~!X;! Pw-X; depends on 
the parameters of the DGP. This point is important, and we will discuss it at 
some length below. 


Structural and Reduced Forms 


When the equations of a linear simultaneous equations model are written in 
the form (12.54), it is normally the case that each equation will have a direct 
economic interpretation. In the model of Section 8.2, for instance, the two 
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equations are intended to correspond to demand and supply functions. It is 
for this reason that these are called structural equations. The full system of 
equations constitutes what is called the structural form of the model. 


It will be convenient for our subsequent analysis to stack the equations (12.54) 
horizontally, instead of vertically as in the system (12.55). We thus define the 
n x g matrix Y as [y1 yo +- yg]. Similarly, the vectors u; of error terms 
can be stacked side by side to form the n x g matrix U. In this notation, the 
entire set of equations (12.54) can be represented as 


Yr =WB+U, (12.68) 


where the g x g matrix I and the l x g matrix B are defined in such a way 
as to make (12.68) equivalent to (12.54). Each equation of the system (12.54) 
contributes one column to (12.68). This can be seen by writing equation i 
of (12.54) in the form 


1 
— Bai 


All of the columns of Y; are also columns of Y, as is y; itself, and so column 7 
of the matrix IF has 1 for element 7, and the elements of the vector — Bə; for 
the other nonzero elements. The endogenous variables that are excluded from 
equation 7 contribute zero elements to the column. Similarly, all the columns 
of Z; are also columns of W, and so the nonzero elements of column i of B 
are the elements of Bii, in appropriate positions. The “structure” of the 
structural equations is embodied in the structure of the matrices F and B. 


[yi yaf | = Zißri + ui. (12.69) 


If (12.68) is to represent a model by which the g endogenous variables are 
generated, it is necessary for I to be nonsingular. We can thus postmultiply 
both sides of equation (12.68) by I~! to obtain 


Y=WBIr'+V, (12.70) 


where V = UT! The representation (12.70) is called the reduced form of the 
model, and its component equations (the columns of the matrix equation) are 
the reduced form equations. These reduced form equations are regressions, 
which in general are nonlinear in the parameters. Because they have only 
exogenous or predetermined regressors, they can be estimated consistently by 
nonlinear least squares. 


Unless all the equations of the system are just identified, (12.70) is in fact 
what is called the restricted reduced form or RRF. This is in contrast to the 
unrestricted reduced form, or URF, which can be written as 


Y = WII +V, (12.71) 


where JI is an unrestricted lx g matrix. Notice that equation (12.56) is simply 
the i™! equation of this system, with y; the it? column of the matrix Y and 
Ti the i** column of the matrix I. 
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It may at first sight seem odd to refer to (12.70) as the restricted reduced form 
and to (12.71) as the unrestricted one. The URF (12.71) has gl regression 
coefficients, since JI is an l x g matrix, while the RRF (12.70) appears to 
have gl + g? parameters, since B is | x g and I is g x g. But remember 
that I has g elements which are constrained to equal 1, and both I’ and B 
have many zero elements corresponding to excluded endogenous and predeter- 
mined explanatory variables, respectively. As readers are invited to show in 
Exercise 12.18, if all the equations of the system are just identified, so that 
the order condition (12.67) is satisfied with equality for each i = 1,...,g, 
then there are exactly as many parameters in the RRF as in the URF. When 
some of the order conditions are inequalities, there are fewer parameters in 
the RRF than in the URF. 


Asymptotic Identification 


Whether or not the parameters of a linear simultaneous system are identified 
by a given data set depends only on the order condition and the properties 
of the actual data, but this is not true of asymptotic identification. Since 
the parameters must be asymptotically identified if the parameter estimates 
are to be consistent, it is worth studying in some detail the conditions for 
asymptotic identification in such a system. 


We assume that the probability limit of n~'W'U is a zero matrix and that 
the l x l matrix 
Switw = plim iww 
n— co 

is positive definite and, consequently, nonsingular. The nonsingularity of the 
matrix W'W is not necessary for identification by a given data set, since, 
if there are enough instruments, it is quite possible that each of the matrices 
PwX;, i = 1,...,g, should have full column rank even though some of the 
instruments are linearly dependent. Similarly, it is not necessary that Syow 
should be nonsingular for asymptotic identification. However, since it is al- 
ways possible to eliminate linearly dependent instruments, it is convenient to 
make the nonsingularity assumption. By doing so, we make it clearer how 
asymptotic identification depends on the actual parameter values. 


For simplicity of notation, we focus on the asymptotic identification of the 
first equation of the system, which can be written as 


Yi = 21811 + Yi Boi + u1. (12.72) 


Since identification can be treated equation by equation without loss of gen- 
erality, and since the ordering of the equations is quite arbitrary, our results 
will be perfectly general. The matrix X; of explanatory variables for the first 
equation is X; = |Z, Yı]. Recall that the n x l matrix W contains all the 
linearly independent columns of the Z;, and in particular those of Z,. Let us 
order the columns of W so that the kı columns of Z; come first. 
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The n x k21 matrix Y; is given by a selection of the columns of the matrix Y. 
The first column of Y, which corresponds to the equation we are studying, 
is not among these, because yı appears only on the left-hand side of that 
equation. However, we can freely reorder the remaining columns of Y so that 
the k21 columns of Y; are the columns 2 through k21 + 1 of Y. This done, 
we can express the first k21 + 1 columns of the URF (12.71), in partitioned 
form, as 


mu Ihi 


Iy VIS wi | + to: Vi). (12.73) 


ma HM 


where we have introduced some further convenient notation. First, the 
n x (l— k11) matrix W; contains all the columns of W that are not in Z4. 
Then, for the ordering that we have chosen for the columns of Y and W, 
mıı is the ky, X 1 vector of parameters in the first reduced form equation 
(that is, the equation that defines yı) associated with the instruments in 
the matrix Zı, while the (l — k11) x 1 vector 721 contains the parameters 
of the first reduced form equation associated with the instruments in Wj. 
Finally, the matrices Jı and Io; are, respectively, of dimensions k11 X k21 
and (l — k11) x k21. They contain the parameters of the reduced form equa- 
tions numbered 2 through k21 + 1 and associated with the instruments in Z: 
and Wj, respectively. The matrix [vı Vj] of error terms is partitioned in 
the same way as the left-hand side of (12.73). 


We can write the matrix PwXı as 
PwXı = Pw|Zı Yil=[2Z1 PwYi), (12.74) 


because Pw Zı = Zı. With the help of (12.73), the second block of the 
rightmost expression above becomes 


IT 
PwYi =[4Z: ml a + Pw, (12.75) 
Io; 
where we again use the fact that Pw|Zı Wi] = [Z W,], and Mı; 


and Io; contain the true parameter values. Reorganizing equations (12.74) 
and (12.75) gives 


Ik, Ihi 


PwX, = W 
waw oe 


| +[0 Par Vi |; (12.76) 
The necessary and sufficient condition for the asymptotic identification of the 
parameters of the first equation is the nonsingularity of the probability limit 
as n — oo of the matrix n~1X\'Py-X,. It is easy to see from (12.76) that 
this limit is 


I, O I, IT; 
li 1i yTp X = 11 S 11 . 
e ee E al MOM i Ha 
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In Exercise 12.19, readers are invited to check that everything that depends on 
the matrix V does indeed tend to zero in the above limit. Since we assumed 
that Swrtw is positive definite, it follows that equation (12.72) is asymptot- 
ically identified if and only if the matrix 


Ikr, Wi 
a 12.77 
| O ed ( ) 


is of full column rank kı = ky, + k21. Because this matrix has | rows, this 
is not possible unless | > kı, that is, unless the order condition is satisfied. 
However, even if the order condition is satisfied, there can perfectly well exist 
parameter values for which (12.77) does not have full column rank. The 
important conclusion of this analysis is that asymptotic identification of an 
equation in a linear simultaneous system depends not only on the properties 
of the instrumental variables W, but also on the specific parameter values of 
the DGP. 


In Exercise 12.20, readers are asked to show that the matrix (12.77) has 
full column rank if and only if the (J — k11) x kg, submatrix I; has full 
column rank. While this is a simple enough condition, it is expressed in 
terms of the reduced form parameters, which are usually not subject to a 
simple interpretation. It is therefore desirable to have a characterization of 
the asymptotic identification condition in terms of the structural parameters. 
In Exercise 12.21, notation that is suitable for deriving such a characterization 
is proposed, and readers are asked to develop it in Exercise 12.22. 


Even when the rank condition for asymptotic identification is not satisfied, the 
numerical condition that the matrix (12.66) be nonsingular will be satisfied by 
almost all data sets. The failure of asymptotic identification will manifest itself 
as the phenomenon of weak instruments that we discussed in Section 8.4. In 
such a case, we might be tempted to add additional instruments, such as lags 
of the instruments themselves or other predetermined variables that may be 
correlated with them. But doing this cannot lead to asymptotic identification, 
because it would simply append columns of zeros to the matrix JT of reduced 
form coefficients, and it is obvious that such an operation cannot convert a 
matrix of deficient rank into one of full rank. 


A discussion of asymptotic identification that is more detailed than the present 
one, but still reasonably compact, is provided by Davidson and MacKinnon 
(1993, Section 18.3). Much fuller treatments may be found in Fisher (1976) 
and Hsiao (1983). 


Three-Stage Least Squares 


The efficient GMM estimator defined by the estimating equations (12.60) is 
not feasible unless X is known. However, we can compute a feasible GMM 
estimator if we can obtain a consistent estimate of X, and this is easy to do. 
We first estimate the individual equations of the system by generalized IV, 
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or two-stage least squares, to use the traditional terminology. This inefficient 
equation-by-equation estimator is characterized formally by the estimating 
equations (12.65). After computing it, we then use the 2SLS residuals to 
compute the matrix Xss, as in (12.17). Using Soszpsg in place of X in equa- 
tions (12.60) yields the popular three-stage least squares, or 3SLS, estimator, 
which was originally proposed by Zellner and Theil (1962). This estimator 
can be written as 


a ~ =i a 
BES = (XF (Ess @ Pw) Xe) Xe (Ess @ Pw)ye- (12.78) 


The relationship between this 3SLS estimator and the 2SLS estimator for the 
entire system is essentially the same as the relationship between the feasible 
GLS estimator (12.18) for an SUR system and the OLS estimator (12.06). As 
with (12.18), we may wish to compute the continuously updated version of 
the 3SLS estimator (12.78), in which case we iteratively update the estimates 
of Be and X by using equations (12.78) and (12.17), respectively. 


From the results (12.62) and (12.63), it is clear that we can estimate the 
covariance matrix of the classical 3SLS estimator (12.78) by 


Var (6235'S) = (X.(2jb.5 @ Pw) Xey’, (12.79) 


which is analogous to (12.19) for the SUR case. Asymptotically valid infer- 
ences can then be made in the usual way. As with the SUR estimator, we can 
perform a Hansen-Sargan test of the overidentifying restrictions by using the 
fact that, under the null hypothesis, the criterion function (12.61) evaluated 
at G3sts and osts is asymptotically distributed as y?(gl — k). Of course, 
this will also be true if the procedure has been iterated one or more times. 


12.5 Maximum Likelihood Estimation 


Like the SUR model, the linear simultaneous equations model can be esti- 
mated by maximum likelihood under the assumption that the error terms, 
in addition to satisfying the requirements (12.02), are normally distributed. 
In contrast to the situation with an SUR system, where the ML estimator is 
numerically identical to the continuously updated feasible GLS estimator, the 
ML estimator of a linear simultaneous equations model is, in general, different 
from the continuously updated 3SLS estimator. The ML and 3SLS estimators 
are, however, asymptotically equivalent. 


Because the algebra of ML estimation is quite complicated, we have divided 
our treatment of the subject between this section and a technical appendix, 
which appears at the end of the chapter, just prior to the exercises. All of the 
principal results are stated and discussed in this section, but many of them 
are derived in the appendix. 
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Full-Information Maximum Likelihood 

The maximum likelihood estimator of a linear simultaneous system is called 
the full-information maximum likelihood, or FIML, estimator. It is so called 
because it uses information about all the equations in the system, unlike 
the limited-information maximum likelihood estimator (LIML) that will be 
discussed later in this section. 


The loglikelihood function that must be maximized to obtain the FIML esti- 
mator can be written in several different ways. In terms of the notation used 
in equation (12.55), it is 


— Z log 2r — 5 log || + nlog | det T| 


— 1 (y, — X.8,)"(27? @ 1) (Ye — Xoo). 


(12.80) 


This looks very much like the loglikelihood function (12.49) for a multivariate 
nonlinear regression model with normally distributed errors. The principal 
difference is the third term, n log | det T|, which is a Jacobian term. This term 
is the logarithm of the absolute value of the Jacobian of the transformation 
from ue to ye. As we will see in the appendix, the loglikelihood function 
can also be written without an explicit Jacobian term if we start from the 
restricted reduced form (12.70). 


Maximizing the loglikelihood function (12.80) with respect to X is exactly 
the same as maximizing the loglikelihood function (12.33) with respect to it. 
If we had ML estimates of Be, or, equivalently, of B and I’, the ML estimate 
of X would be 


Sun = (Y fu — WB) (Y fur — WB), (12.81) 


1 
n 
which is just the sample covariance matrix of the structural-form error terms; 
compare equation (12.36). 


Recall from (12.54) that the parameter vector 8; of equation i contains both 
the vector Bı, which is associated with the predetermined explanatory vari- 
ables, and the vector 32;, which is associated with the endogenous explanatory 
variables. As is clear from equation (12.69), the matrix B is determined by 
the Bı; alone, and the matrix I by the Go; alone. We can obtain the first-order 
conditions for maximizing the loglikelihood function (12.80) with respect to 
the Bı; in exactly the same way as we obtained conditions (12.12) from the 
criterion function (12.11) for an SUR system. The first-order conditions that 
we seek can be written as 


Z! (57! @In)(ye — X08.) = 0, (12.82) 


where the gn x >, ki; matrix Ze is defined, similarly to X., as a matrix with 
diagonal blocks Z;. The number of equations in (12.82) is $., k1;, since there 
is one equation for each of the Bii. 
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Since it is rather complicated to work out the first-order conditions for the 
maximization of (12.80) with respect to the G2;, we leave this derivation to 
the appendix. These conditions can be expressed as 


YS (B, T)(X7! @In)(ye — XB.) = 0, (12.83) 


where the gn x }), ko; matrix Y,(B, I) is again defined in terms of diagonal 
blocks. Block i is the n x ka; matrix Y;(B,I), which is the submatrix of 
WBT! formed by selecting the columns that correspond to the columns 
of the matrix Y; of included endogenous explanatory variables in equation 7. 
The conditions (12.82) and (12.83) can be grouped together as 


X (B, T(E! @In)(ye — Xe.) = 0, (12.84) 


where the i*® diagonal block of X.(B,I°) is the n x k; matrix |Z; Y;(B,I)}. 
There are k = $`; ki; + >2, kai equations in (12.84). 


With (12.81) and (12.84), we have assembled all of the first-order conditions 
that define the FIML estimator. We write them here as a set of estimating 
equations: 


XJ (But, A) (Sp 9 In)(Ye — X.) = 0, and 
(12.85) 


Su = (Y fu — WÊyuL) (Y fur — WB). 


1 
n 
Solving these equations, which must of course be done numerically, yields the 
FIML estimator. 


There are many numerical methods for obtaining FIML estimates. One of 
them is to make use of the artificial regression 


(P'S In)(Ye — Xebe) = (W' & In) Xe (B, T)b + residuals, (12.86) 


where, as usual, WW' = Xt. This is analogous to the multivariate GNR 
(12.53). If we start from initial consistent estimates, this artificial regression 
can be used to update the estimates of B and I’, and equation (12.81) can 
be used to update the estimate of X. Like other artificial regressions, (12.86) 
can also be used to compute test statistics and covariance matrices. 


Another approach is to concentrate the loglikelihood function with respect 
to X. As readers are asked to show in Exercise 12.24, the concentrated 
loglikelihood function can be written as 


-Z (log 2r +1) +nlog |det P| — 2 log|= (YI - XB)" (YT - XB)|, (12.87) 
which is the analog of (12.41) and (12.51). Expression (12.87) may be maxi- 
mized directly with respect to B and I to yield By, and yy. This approach 


may or may not be easier numerically than solving equations (12.85). 
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The FIML estimator is not defined if the matrix (YT — XB)! (YT — XB) 
that appears in (12.87) does not have full rank for all admissible values of B 
and I’, and this requires that n > g+ k. This result suggests that n may have 
to be substantially greater than g + k if FIML is to have good finite-sample 
properties; see Sargan (1975) and Brown (1981). 


Comparison with Three-Stage Least Squares 


We remarked at the beginning of this section that the FIML estimator is not 
in general equal to the continuously updated 3SLS estimator. In order to 
study the relationship between the two estimators, we write out explicitly the 
estimating equations for 35LS and compare them with (12.85), the estimating 
equations for FIML. Equations (12.58) and (12.17) imply that the continu- 
ously updated version of the 3SLS estimator is defined by the equations 


Xo (Sss @ In) (ye — X.B25™*) = 0, and 
. : i . . 12.88) 
Sssts = —(YI3sis — WBssis)'(¥Issis — W Basis). 
The second of these equations has exactly the same form as the second equa- 
tion of (12.85). The first equation is also very similar to the first equation 
of (12.85), but there is one difference. In (12.85), the leftmost matrix on 
the left-hand side of the first equation is the transpose of X. (BmL, IT™mL), of 
which the typical diagonal block is |Z; Y;(BmL,ImL)]. In contrast, the 
corresponding matrix in the first equation of (12.88) is the transpose of Xe, 
of which the typical diagonal block is, from (12.57), |Z; PwY;]. 


In both cases, the matrix is an estimate of the matrix of optimal instruments 
for equation i, that is, the matrix of the expectations of the explanatory 
variables conditional on all predetermined information. It is clear from the 
RRF (12.70) that this matrix is | Z; Yj;(B,I)], where B and T are the true 
parameters of the DGP. FIML uses the FIML estimates of B and I in place 
of the true values, while 3SLS estimates Y;(B, T) by Pw Yi, that is, by the 
fitted values from estimation of the unrestricted reduced form (12.71). The 
latter will, in general, be less efficient than the former. 


If the restricted and unrestricted reduced forms are equivalent, as they will 
be if all the equations of the system are just identified, then the estimating 
equations (12.88) and (12.85) are also equivalent, and the 3SLS and FIML 
estimators must coincide. In this case, as we saw in the last section, 3SLS 
is also the same as 2SLS, that is, equation-by-equation IV estimation. Thus 
all the estimators we have considered are identical in the just-identified case. 
When there are overidentifying restrictions, and 3SLS is used without con- 
tinuous updating, then the 3SLS estimators of B and I are replaced by the 
2SLS ones in the second equation of (12.88). Solving this equation yields the 
classical 3SLS estimator (12.78), which is evidently much easier to compute 
than the FIML estimator. 
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Our treatment of the relationship between 3SLS and FIML has been quite 
brief. For much fuller treatments, see Hausman (1975) and Hendry (1976). 


Inference based on FIML Estimates 


Since the first equation of (12.85) is just an estimating equation for efficient 
GMM, we can estimate the covariance matrix of 3M" by the obvious estimate 
of n~! times the asymptotic covariance matrix (12.63), namely, 


Var (AMY) = (X! (Êy, Pu) (SL 9 In) Xe (Bu, Put)) `. (12.89) 


Notice that, if we evaluate the artificial regression (12.86) at the ML estimates, 
then 1/s? times the OLS covariance matrix is equal to this matrix. 


There are two differences between the estimated covariance matrix for FIML 
given in equation (12.89) and the estimated covariance matrix for the classical 
3SLS estimator given in equation (12.79). The first is that they use different 
estimates of X. The second is that, in (12.89), the endogenous variables in 
X, are replaced by their fitted values, based on the FIML estimates, while in 
(12.79) they are replaced by their projections on to 8(W). 


If the model (12.54) is correctly specified, and the error terms really do satisfy 
the assumptions we have made about them, then each row V; of the matrix 
of error terms V in the URF (12.71) must have properties like those of the 
structural error terms U; in (12.03). This implies that the error terms in every 
equation of the URF must be homoskedastic and serially independent. This 
suggests that the first step in testing the statistical assumptions on which 
FIML estimation is based should always be to perform tests for heteroskedas- 
ticity and serial correlation on the equations of the unrestricted reduced form; 
suitable testing procedures were discussed in Sections 7.5 and 7.7. If there 
is strong evidence that the V; are not IID, then either at least one of the 
structural equations is misspecified, or we need to make more complicated 
assumptions about the error terms. 


It is also important to test any overidentifying restrictions. In the case of 
FIML, it is natural to use a likelihood ratio test rather than a Hansen-Sargan 
test, as we suggested for 3SLS and SUR estimation. The number of restrictions 
is, once again, gl — k, the difference between the number of coefficients in 
the URF and the number in the structural model. The restricted value of 
the loglikelihood function is the maximized value of either the loglikelihood 
function (12.80) or the concentrated loglikelihood function (12.87), and the 
unrestricted value is 


- 2 (log 2 + 1) Blog| -(¥ wD (Y —wi)|, 


where IT denotes the matrix of OLS estimates of the parameters of the URF. 
Twice the difference between the unrestricted and restricted values of the 
loglikelihood function is asymptotically distributed as y?(gl — k) if the model 
is correctly specified and the overidentifying restrictions are satisfied. 
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Limited-Information Maximum Likelihood 


When a system of equations consists of just one structural equation, together 
with one or more reduced-form equations, the FIML estimator of the struc- 
tural equation reduces to a single-equation estimator. We can write the single 
structural equation as 

y= Zi tr YB + U, (12.90) 


where we use a notation similar to that of (12.54), but without indices on the 
variables and parameters. There are kı elements in G; and kə in Go, with 
k = kı + k2. A complete simultaneous system can be formed by combining 
(12.90) with the equations of the unrestricted reduced form for the endogenous 
variables in the matrix Y. We write these equations as 


Y = WII + V = ZIL + Wilk + V, (12.91) 


where the matrix W, contains all the predetermined instruments that are 
excluded from the matrix Z. 


Since the equations of the unrestricted reduced form are just identified by 
construction, the only equation of the system consisting of (12.90) and (12.91) 
that can be overidentified is (12.90) itself. If it is also just identified, then, as 
we have seen, 3SLS and FIML estimation both give exactly the same results 
as IV estimation of (12.90) by itself. If equation (12.90) is overidentified, then 
it turns out that 3SLS, without continuous updating, also gives the same 
estimates of the parameters of (12.90) as IV estimation. Readers are asked to 
prove this result in Exercise 12.27. However, continuously updated 3SLS and 
ML give different, and possibly better, estimates in this case. 


Maximum likelihood estimation of equation (12.90), implicitly treating it as 
part of a system with (12.91), is called limited-information maximum like- 
lihood, or LIML. The terminology “limited-information” refers to the fact 
that no use is made of any overidentifying or cross-equation restrictions that 
may apply to the parameters of the matrix JT of reduced-form coefficients. 
Formally, LIML is FIML applied to a system in which only one equation is 
overidentified. However, as we will see, LIML is in fact a single-equation 
estimation method, in the same sense that 2SLS applied to (12.90) alone is 
a single-equation method. The calculations necessary to see this are rather 
complicated, and so here we will simply state the principal result, which dates 
back as far as Anderson and Rubin (1949). A derivation of this result may be 
found in Davidson and MacKinnon (1993, Chapter 18). 


The Anderson-Rubin result is that the LIML estimate of G2 in equation 
(12.90) is given by minimizing the ratio 

(y — Y¥B2)'Mz(y — ¥B2) 
(y — YB2)'Mw(y — YA2)’ 


where Mz projects off the predetermined variables included in (12.90), and 
Mw projects off all the instruments, both those in Z and those in W1. The 


(12.92) 


R= 
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value & that minimizes (12.92) may be found by a non-iterative procedure 
that is discussed in the appendix. The maximized value of the loglikelihood 
function is then 


— log (2m) — log (&) — $ log |¥.'Mw¥.|, (12.93) 


where Y, = |y Y]. 


If we write equation (12.90) as y = XB + u, then the LIML estimator of 8 is 
defined by the estimating equations 


XI — È Mw )(y — XB™™") = 0, (12.94) 
which can be solved explicitly once & has been computed. We find that 
BME = (XT (L—- &My)X) XI — R Mwy. (12.95) 
A suitable estimate of the covariance matrix of the LIML estimator is 
Var (BUM) = 6?(X (I - R Mw) X), (12.96) 


where 
(y — XBEML)T (y O XBEML), 


Given (12.96), confidence intervals, asymptotic t tests, and Wald tests can 
readily be computed in the usual way. 


Since W = |Z W,|] is the matrix containing all the instruments, we can 
decompose My as Mz — Pm,w,. This makes it clear that «x > 1, since the 
numerator of (12.92) cannot be smaller than the denominator. If equation 
(12.90) is just identified, then, by the order condition, Y and W; have the 
same number of columns. In this case, it can be shown that the minimized 
value of K is actually equal to 1; see Exercise 12.28. 


In the context of 2SLS estimation, we saw in Section 8.6 that the Hansen- 
Sargan test can be used to test overidentifying restrictions. In the case of 
LIML estimation, it is easier to test these restrictions by a likelihood ratio test. 
As shown in Exercise 12.28, the maximized loglikelihood of the unconstrained 
model for which the overidentifying restrictions of (12.90) are relaxed is the 
same as expression (12.93) for the constrained model, but with « = 1. Thus 
the LR statistic for testing the overidentifying restrictions, which is twice 
the difference between the unconstrained and constrained maxima, is simply 
equal to nlog &. This test statistic was first proposed by Anderson and Rubin 
(1950). Since there are l — k overidentifying restrictions, the LR statistic is 
asymptotically distributed as y?(1 — k). 
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K-class Estimators 


In equation (12.94), we have written the LIML estimating equations in the 
form of the estimating equations for a K-class estimator, following Theil 
(1961). The A-class is the set of estimators defined by the estimating equa- 
tions (12.94) with an arbitrary scalar K replacing Å. The LIML estimator is 
thus a class estimator with K = Å. Similarly, the 2SLS estimator (12.60) is 
a K-class estimator with K = 1, and the OLS estimator is a -class estimator 
with K = 0. 


Numerous other A-class estimators have been proposed. It can be shown that, 
under standard regularity conditions, these estimators are consistent whenever 
the plim of K is 1. Thus 2SLS is consistent, and OLS is inconsistent. Since 
nlog& is asymptotically distributed as y?(1 — k) when the overidentifying 
restrictions are satisfied, it must be the case that plim log & = 0, which implies 
that plim & = 1. It follows that LIML is asymptotically equivalent to 2SLS. In 
finite samples, however, the properties of LIML may be quite different from 
those of 2SLS. The strangest feature of the LIML estimator is that it has no 
finite moments. This implies that its density tends to have very thick tails, 
as readers are asked to illustrate in Exercise 12.32. However, if we measure 
bias by comparing the median of the estimator with the true value, the LIML 
estimator is generally much less biased than the 2SLS estimator. 


Fuller (1977) has proposed a modified LIML estimator that sets K equal to 
R — a/(n—k), where a is a positive constant that must be chosen by the 
investigator. One good choice is a = 1, since it yields estimates that are 
approximately unbiased. In contrast to the LIML estimator, which has no 
finite moments, Fuller’s modified estimator has all moments finite provided 
the sample size is large enough. Mariano (2001) provides a recent summary 
of the finite-sample properties of LIML, 2SLS, and other A-class estimators. 


Invariance of ML Estimators 


One important feature of the FIML and LIML estimators is that they are 
invariant to any reparametrization of the model. This is actually a general 
property of all ML estimators, which was explored in Exercise 10.14. Since 
simultaneous equations systems can be parametrized in many different ways, 
this is a useful property for these estimators to have. It means that two 
investigators using the same data set will obtain the same estimates even if 
they employ different parametrizations. 


As an example, consider the two-equation demand-supply model that was first 
discussed in Section 8.2: 
qe = Ya pe + Xf Ba t+ uf (12.97) 
dt = Ys Pt T Xf Bs F už. (12.98) 


As the notation indicates, equation (12.97) is a demand function, and equation 
(12.98) is a supply function. In this system, p; and q denote the price and 
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quantity of some commodity in period t, which may well be in logarithms, Xf 
and X; are row vectors of exogenous or predetermined variables, Gq and 6s 
are the corresponding vectors of parameters, and yq and ys are the slopes of 
the demand and supply functions, which can be interpreted as elasticities if 
pz and q; are in logarithms. 


Now suppose that we reparametrize the supply function as 
Pt = Ve + XP By + ut, (12.99) 


where y; = 1/7, and B% = —G;/7s. The invariance property of maximum like- 
lihood implies that, if we first use FIML to estimate the system consisting of 
equations (12.97) and (12.98) and then use it to estimate the system consist- 
ing of equations (12.97) and (12.99), we will obtain exactly the same estimates 
of the parameters of equation (12.97). Moreover, the estimated parameters of 
equations (12.98) and (12.99) will bear precisely the same relationship as the 
true parameters. That is, 


4. =1/4, and Ê; = -—Ês/ĵs. (12.100) 


If we use LIML to estimate equations (12.98) and (12.99), the two sets of 
LIML estimates will likewise satisfy conditions (12.100). 


The invariance property of LIML and FIML is not shared by 2SLS, 3SLS, or 
any other GMM estimator. If, for example, we use 3SLS to estimate the two 
versions of this system of equations, the two sets of estimates will not satisfy 
conditions (12.100); see Exercise 12.31. 


12.6 Nonlinear Simultaneous Equations Models 


As we saw in Section 12.3, it is fairly straightforward to extend the SUR 
model so as to allow for the possibility of nonlinearity. However, additional 
complications can arise with nonlinear simultaneous equations models. With 
an SUR system, the right-hand sides of the several regressions do not depend 
on current endogenous variables, but this is not true of a simultaneous system. 
If endogenous variables enter nonlinearly in such a system, then, since it is 
not always possible to find solutions to nonlinear equations in closed form, it 
may be infeasible to set up a reduced form in which each endogenous variable 
is expressed as a function only of predetermined variables and parameters. 


Feasible Efficient GMM 


The easiest way to take account of all interesting cases is to work in terms of 
zero functions and treat the nonlinear simultaneous system by the methods 
we developed in Section 9.5 for nonlinear GMM. The main extension needed 
for a simultaneous system is just that each elementary zero function depends, 
in general, on a vector of endogenous variables, rather than on just one. 
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Suppose that there are g equations that, for each observation, simultaneously 
determine g endogenous variables, and suppose further that these equations 
can be written as 


ta Ht tale, TS lsp. 


The functions f;;(-) depend implicitly on predetermined explanatory variables. 
They are, in general, nonlinear functions of both the 1 x g vector Y; that 
contains the endogenous variables for observation t and the k-vector @ of 
model parameters. The u are error terms with mean zero. In some cases, 
we may be ready to assume that the uş; satisfy the conditions (12.02) that we 
have imposed on the other models considered in this chapter. 


It is clear that the fy are elementary zero functions. We may stack them 
in the way we stacked the dependent variables of an SUR system. First, we 
define the n-vectors f;(Y,0), i=1,...,g, so that the t'® element of f;(Y, 0) 
is ful(Y:, 0), where Y is the n x g matrix of which the tt? row is Y;. Then 
we stack the f; vertically to construct the gn x 1 vector f.(Y,0). Under 
assumptions (12.02), the covariance matrix of this stacked vector is X & Iņ. 


According to the theory developed in Section 9.5, the optimal instruments 
for efficient GMM are given in terms of the matrix F(@) defined in equation 
(9.85). If, as before, we define the g x g matrix W such that WW! = X7! then 
the matrix W of (9.85) becomes W @I,, in the present case. The matrix F (0) 
of that equation becomes a gn x k matrix F.(Y,0), of which the tit? element 
is the derivative of the tt! element of f.(Y,0) with respect to 0;, the it! ele- 
ment of 0. Under assumptions (12.02), the matrix F, needed for the optimal 
estimating equations is just the gn x k matrix of which the tt? row is the 
expectation of the tt? row of F, conditional on all information predetermined 
at time t. The estimating equations we need correspond to equations (9.82). 
However, as discussed in the paragraph following (9.82), we must use F,(6) 
instead of F.(@) in formulating the optimal instruments. We obtain 


F'(@)(=—-! @ In) fo(Y, 0) = 0. (12.101) 


Although the notation differs slightly, the only important difference between 
(9.82) and (12.101) is that the latter equations involve F,(0) instead of F, (0). 
There is also no factor of n™t in (12.101), an omission that evidently has no 
effect on the solution. 


It is precisely in the construction of the matrix F, that difficulties may arise. 
Since there may be no analytical expression for some or all of the endogenous 
variables, there may be no direct way of computing or even estimating Fy. 
In that case, we may proceed as in Section 9.5 by selecting a set of | > k 
instruments, that we group into the n x | matrix W. We then replace the 
estimating equations (12.101) by 


F.'(Y,0)(2~' 8 Pw) f-(Y, 0) = 0, (12.102) 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


12.6 Nonlinear Simultaneous Equations Models 533 


which closely resemble equations (12.60) for the linear case. Equivalently, we 
may minimize the criterion function 


F(Y, 0 (57t ® Pw) fe(Y, 4), (12.103) 


which is comparable to expression (12.61) for the linear case. The first-order 
conditions for minimizing (12.103) with respect to @ are equivalent to the 
estimating equations (12.102). 


If, as will usually be the case, the matrix X is not known, then we must 
first obtain preliminary consistent estimates, say Ó. We might do this by 
solving the estimating equations (12.102) or minimizing the criterion func- 
tion (12.103) with X replaced by an identity matrix. Alternatively, if cross- 
equation restrictions are not needed for identification, we might estimate each 
equation separately by the methods of Section 9.5. We can then use these 
preliminary estimates to form an estimate of X by the formula 


AI(Y, 8) 

Zaz}: MAO = LD: 

fo (Y, 0) 

This estimate can then be used in either (12.102) or (12.103) to obtain more 


efficient estimates. We can either stop after one round or iterate to obtain 
continuously updated estimates. 


The one-round procedure yields a generalization of the nonlinear instrumental 
variables, or NLIV, estimator ÔnLIV, which we first encountered in Section 8.9. 
It was originally proposed by Jorgenson and Laffont (1974). In Exercise 12.33, 
readers are asked to write down the first-order conditions that define the 
estimator ÔNLIV, along with the usual estimate of its covariance matrix. 


The NLIV estimator is sometimes called nonlinear three-stage least squares, 
or NL3SLS. We prefer not to do so, because that name is quite misleading. 
For the reasons discussed in Section 8.9 in connection with nonlinear two- 
stage least squares, we never actually replace endogenous variables by their 
fitted values from reduced-form regressions. Moreover, there are really just 
two stages, the first in which preliminary consistent estimates are obtained, 
the second in which (12.102) or (12.103) is used with the estimated X. 


Nonlinear FIML Estimation 


The other full-system estimation method that is widely used is nonlinear 
FIML. In order to derive the loglikelihood function, it is convenient to stack 
the vectors f:(Y, 0) horizontally. Let h:(¥:,@) be a 1 xg row vector containing 
the elements fi1,..-, ftg. Then the model to be estimated can be written as 


hi(¥;,0) =U;, U, ~ NID(O, X). (12.104) 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


534 Multivariate Models 


The row vector U; contains the error terms uz;, i = 1,...,g, which are now 
assumed to be multivariate normal. In order to obtain the density of Y;, 
we start from the density of U;, replace U; by hi(¥;,@), and multiply by 
the Jacobian factor |det J;|, where J; = Oh,(@)/OY;, is the g x g matrix of 
derivatives of h; with respect to the elements of Y;. The result is 


(20) 4/? | det Jy||E|-1/? exp(— Eh: (¥i, 0) 57h] (Y+, 0)). 


Taking the logarithm of this, summing it over all observations, and then con- 
centrating the result with respect to X, yields the concentrated loglikelihood 
function for the model (12.104): 


1 
- 2 (log 2x + 1) + È` log |det J;| — 2 og|+ SO hi (%;, 8) hi(¥i, 8)]. 
t=1 t=1 


The main difference between this function and its counterpart for the linear 
case, expression (12.87), is that the Jacobian matrices J; are in general dif- 
ferent for each observation. Evaluating all these determinants could well be 
expensive when n is large and g is not very small. 


Another difference between the linear and nonlinear cases is that, in the lat- 
ter, FIML and NLIV are not even asymptotically equivalent in general. In 
fact, if the error terms are not normally distributed, the FIML estimator may 
actually be inconsistent; see Phillips (1982). If the errors are indeed normal, 
then, for the usual reasons, the FIML estimator is more efficient asymptot- 
ically, although its efficiency may come at a price in terms of computational 
complexity. More detailed treatments of nonlinear FIML estimation may be 
found in Amemiya (1985, Chapter 8) and Gallant (1987, Chapter 6). 


12.7 Final Remarks 


Notation is a bugbear with multivariate regression models. These models 
can be written in many equivalent ways, and notation that is well suited to 
one estimation method may not be convenient for another. Once the nota- 
tional hurdle has been crossed, we have seen that it is not excessively difficult 
to estimate multivariate regression models, including simultaneous equations 
models, using a variety of familiar techniques. All the procedures we have 
discussed use some combination of (feasible) generalized least squares, in- 
strumental variables, GMM, and maximum likelihood. Except in the case of 
nonlinear simultaneous equations models, there is always a technique based on 
feasible GLS and/or instrumental variables that is asymptotically equivalent 
to maximum likelihood. 
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12.8 Appendix: Detailed Results on FIML and LIML 


This appendix derives several results on FIML and LIML estimation that were 
too technical to include in the main text. 


First-Order Conditions for FIML 


For the purpose of obtaining the first-order conditions (12.83), it is convenient 
to write the loglikelihood function (12.80) in terms of the restricted reduced 
form (12.70). In the RRF, the y; are stacked horizontally. However, if we 
are to use the same approach as for the SUR model, we must stack them 
vertically. The it! column of (12.70) can be written as 


yi = WBy' + vi, (12.105) 


where the g-vector 7’ is the i*® column of P}, and v; is the it column of V. 
Then equations (12.105) can be written as 


ye = (I, 9 WB)? + ve 
= (I 8 W)r. +v (12.106) 
= W.T. + Us: 


Here the g?-vector y° contains the -y’ stacked vertically, the gn-vector ve 
contains the v; stacked vertically, the gl x gn matrix W, denotes I, & W, and 
the gl-vector 7, contains the m; stacked vertically. The m; are the columns 
of the matrix IT, defined here as BIT}, as in the restricted reduced form. 


By rewriting the last equation in (12.106) so that ve is a function of Ye, we 
obtain the transformation that gives ve in terms of Yə. Exactly as with the 
transformation (12.31), the determinant of the Jacobian of this transformation 
is 1. Thus, in order to obtain the joint density of y., we simply have to find 
the density of the vector ve and then replace ve by ye — We Te. 


Since we have assumed that ve is multivariate normal, and we know that its 
expectation is a zero vector, the only thing we need to write down its density 
is its covariance matrix. Recall that V = UIt, where U is the matrix of 
structural form errors. Thus 


g 
weUr =y ue Tshad 
j=l 


where 7" is the jit element of [~!. By stacking these equations vertically, 
it is not hard to see that 


Ue = (ey Q Tis: 
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Since the covariance matrix of ue is assumed to be X @I,, it follows that the 
covariance matrix of ve can be written as 


Var(ve) =H. ) = (TT! 8 n) (X 8 In) (T7! @1,) 
=(r')'zsr' g In. 


For some of the following calculations, it will be convenient to denote the 
matrix (PT) ISI! by R. 


Using this notation, the density of Ye is (2r) 79/2 times 
|28 m exp(- 3 (Ye — Wene) (27 8 In)(Ye — W.7.)). 


This may be compared with (12.32), the analogous expression for a linear SUR 
system. It follows that the loglikelihood function for the linear simultaneous 
equations model can be written as 


-T log 2m — 2 log |Q| — (ye — Wee) (27! @In) (Ye — Were). (12.107) 


This expression is deceptively simple, because the vector me depends in a 
complicated way on the vector of structural parameters Be. However, since 
(12.107) depends on §2 in precisely the same way in which expression (12.33), 
the loglikelihood function for a linear SUR system, depends on X, the ML es- 
timator of R must have exactly the same form as (12.36). 


It is of interest to compare the loglikelihood functions (12.107) and (12.80). 
A little algebra, which is detailed in Exercise 12.23, shows that 


(T'S Ta Oe a Wene) = Ye — Xebe, (12.108) 


which is the vector of residuals from the structural form expressed as in (12.55) 
in stacked form. Thus the quadratic form that appears in (12.107) can also 
be written as 

(ys — Xebe) (X7! 9 In)(Ye — Xebe). (12.109) 


Now consider the second term in (12.107). By the definition of @ and the 
properties of determinants, this term is 


-$ log |2| = — F log(|det F|-7|Z|) = nlog|det r|- Flog|Z|. (12.110) 


If we start with (12.107) and replace the quadratic form by expression (12.109) 
and the second term by the rightmost expression in (12.110), we obtain the 
loglikelihood function (12.80). Thus we see that these two ways of writing the 
loglikelihood function are indeed equivalent. 


In order to write down the ML estimator of (2, we define the n x g matrix 
V(@.) to have it column y; — WB’, which is just the itè block of the 
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vector ye — Were. It follows that V(G@.) = Y - WBI—'. When evaluated 
at the ML estimator GM", this is just the ML estimator of the errors of the 
RRF (12.70). By analogy with (12.36), we find that 


Qu = VBS )V (BN). 


We are entitled to write V as a function of Be here because, as we saw when 
defining the RRF, the matrices B and I on which (12.107) depends through 
the vector W.7, are uniquely determined by the structural parameters in the 
vector Be. Conversely, if we obtain ML estimators of the matrices B and I, 
these uniquely determine the ML estimator of Be. 


Only the last term of the loglikelihood function (12.107) depends on B and I. 
Therefore, conditional on X, the maximization of (12.107) reduces as usual 
to the minimization of a quadratic form, which in this case is 


(ye — Were) (RT! S In) (Yo — Were). (12.111) 


From the definition of 92 and the properties (12.08) of Kronecker products, 
we observe that RT! & In = (T @1,)(27! @1,)(F' @I,). 


From the first equation in (12.106), we can see that the quadratic form 
(12.111) can also be written as 


(ye — (ly @ WB)y*) (27! 8 Ln) (ye — (Ig @ WB)Y’). 


From this expression, we see that the partial derivatives of (12.107) with 
respect to the g? elements of y° are the g? elements of the vector 


(I, ®@ B'W"')(27' @I,) (ye — (I, 9 WB)7°). (12.112) 


The conditions we seek are not given by simply equating the elements of this 
vector to zero, because many elements of the matrix I are restricted to be 
equal to 0 or 1. The restrictions translate into complicated conditions on the 
elements of y° which, fortunately, we need not concern ourselves with. Rather, 
we compute the derivatives of y° with respect to any element y;; of I’ which is 
not restricted, and then use the chain rule to obtain the derivative of (12.107) 
with respect to 7;;. We can then quite properly equate the resulting derivative 
to zero in order to obtain a first-order condition. 


The vectors that are stacked in y° are the columns of Pt, and it is therefore 
not hard to see that (I"' & I})y° is a vector of g? components that are all 
either 0 or 1, and thus independent of the elements of I’. Differentiating this 
relation with respect to y;; thus gives 

oy’? 


(Eji 8 Iy)? + (r'e Ia 
ij 


= 0, 
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where Ej; is a g X g matrix of which the jit element is 1 and the other 
elements are 0. Consequently, the derivative of y° with respect to 7; is the 
g°-vector 

- (PT) 8 I) (Ej: 8 Ip)7°. 
The derivative of expression (12.107) with respect to 7; is the scalar product 
of this vector with the vector (12.112), that is, the negative of 


y (Ei; 8 I,)( T: 8 Iz) (Iy 8 B'W' (R~ D In)(Ye — Were) 
= 7°! (Ej 8 L)( T7! 8 B'W')(I 8 In)( 57! 8 In)(ye — Xepe) 
= 7°! (Ei; 8 B'W?! (E7! 8 In)(Ye — Xe 8e). (12.113) 


The second line above makes use of the expression of (2 in terms of F and X, 
and of the result (12.108). It is straightforward to see that (12.113) is one 
row of the left-hand side of (12.83), which therefore contains all the first-order 
conditions with respect to the unrestricted elements of I. 


Eigenvalues and Eigenvectors 


Before we can discuss LIML estimation, we need to introduce a few more 
concepts of matrix algebra. A scalar À is said to be an eigenvalue (also called 
a characteristic root or a latent root) of a matrix A if there exists a nonzero 
vector x such that 

Ag =a: (12.114) 


Thus the action of A on æ produces a vector with the same direction as x, but 
a different length unless A = 1. The vector æ is called the eigenvector that 
corresponds to the eigenvalue A. Although these concepts are defined quite 
generally, we will restrict our attention to the eigenvalues and eigenvectors of 
real symmetric matrices. 


Equation (12.114) implies that 
(A—\I)x =0, (12.115) 


from which we conclude that the matrix A — XI is singular. Its determinant, 
|A — XI], is therefore equal to zero. It can be shown that this determinant 
is a polynomial in À. The degree of the polynomial is n if A is n x n. The 
fundamental theorem of algebra tells us that such a polynomial has n complex 
roots, say À1,..., An. To each A; there must correspond an eigenvector £i. 
This eigenvector is determined only up to a scale factor, because if æ; is an 
eigenvector corresponding to \;, then so is aa; for any nonzero scalar a. The 
eigenvector x; does not necessarily have real elements if A; itself is not real. 
If A is a real symmetric matrix, it can be shown that the eigenvalues À; are 
all real and that the eigenvectors can be chosen to be real as well. If A is also 
a positive definite matrix, then all its eigenvalues are positive. This follows 
from the facts that 2'Aa = As'x and that both æ'x and x'Aw must be 
positive scalars when A is positive definite. 
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The eigenvectors of a real symmetric matrix can be chosen to be mutually 
orthogonal. Consider any two eigenvectors x; and x; that correspond to two 
distinct eigenvalues A; and Aj. We see that 
Nw; £i = xj Ax; = (Aa;)'a; =; £) £i. 

But this is impossible unless £j Li = 0. Thus we conclude that x; and 2; 
are necessarily orthogonal. If not all the eigenvalues are distinct, then two (or 
more) eigenvectors may correspond to one and the same eigenvalue. When 
that happens, these two eigenvectors span a space that is orthogonal to all 
other eigenvalues by the reasoning just given. Since any linear combination 
of the two eigenvectors will also be an eigenvector corresponding to the one 
eigenvalue, one may choose an orthogonal set of them. Thus, whether or not 
all the eigenvalues are distinct, eigenvectors may be chosen to be orthonormal, 
by which we mean that they are mutually orthogonal and each has norm equal 
to 1. When the eigenvectors of a real symmetric matrix A are chosen in this 
way, they provide an orthonormal basis for $(A). 


Let U =[a, --- £n] be a matrix the columns of which are an orthonormal 
set of eigenvectors of A, corresponding to the eigenvalues \;, i = 1,...,n. 
Then we can write the eigenvalue relationship (12.114) for all the eigenvalues 
at once as 


AU =UA, (12.116) 


where A is a diagonal matrix with A; as its it diagonal element. The itè 
column of AU is Azx;, and the i*® column of UA is Aixi. Since the columns of 
U are orthonormal, we find that U'U =I, which implies that U! = U~1. A 
matrix with this property is said to be an orthogonal matrix. Postmultiplying 
(12.116) by U' gives 

A = UAU". (12.117) 


Taking determinants of both sides of (12.117), we obtain 
|A| = |U||U"||A] = [U|U A] = lAl =] [> 
i=1 


from which we may deduce the important result that the determinant of a 
symmetric matrix is the product of its eigenvalues. In fact, this result holds 
for nonsymmetric matrices as well. 


LIML Estimation 


Consider the system of equations consisting of the structural equation (12.90) 
and the reduced form equations (12.91). The matrix of coefficients of the 
endogenous variables in this system of equations is 


l-e 1 
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Because this matrix is triangular, its determinant is simply the product of 
the elements on the principal diagonal, which is 1. Therefore, there is no 
Jacobian term in the loglikelihood function (12.80) for such a system, and the 
ML estimates may be obtained by minimizing the determinant 


(Y — XBr-)'(Y — XBr!)| = |(Yr — XB)' (YT — XB)|. 
| = | 


It can, with considerable effort, be shown that minimizing this determinant 
is equivalent to minimizing the ratio 


(y — YB2)'Mz(y—YB2) _ 7 Y.'MzyY.7 


= 12.118 
(y — YB2)"Mw(y-YG2) yY! MwY. y l ) 


R= 


where Y, = [y Y] and y = [1 į: — 62]; see Davidson and MacKinnon (1993, 
Chapter 18). 


It is possible to minimize «x without doing any sort of nonlinear optimization. 
The first-order conditions obtained by differentiating the rightmost expression 
in (12.118) with respect to y are 


2Y¥*'MzY*7(7'Y*'!MwY*y) —2Y*'MyY*7(7' Y*'MzY*7) = 0. 
If we divide both sides by 2y! Y*! MwY *y, this becomes 
Y*'MzY*y—«Y*'MwY*y=0. (12.119) 


An equivalent set of first-order conditions can be obtained by premultiply- 
ing (12.119) by (Y*'MyY*)~!/? and inserting that factor multiplied by its 
inverse before y. After some rearrangement, this yields 


(Y*'MwY*) "PY* MzY* (Y* MwY*) 1? = KI)" = 0, 


where y* = (Y*'MwY*)!/?¥. This set of first-order conditions now has 
the form of a standard eigenvalue-eigenvector problem for a real symmetric 
matrix; see equation (12.115). Thus it is clear that Å is an eigenvalue of the 
matrix 


(Y*'MwY*)1?Y*'"MzY*(Y*'MyY*)1, (12.120) 


which depends only on observable data, and not on unknown parameters. In 
fact, & must be the smallest eigenvalue, because it is the smallest possible 
value of the ratio (12.118). Given &, we can use equations (12.95) to compute 
the LIML estimates. It is worthy of note that, if there is only one endogenous 
variable in the matrix Y, then the determinantal equation that determines the 
eigenvalues of (12.120) is just a quadratic equation, of which the smaller root 
is &, which can therefore be expressed in this case as a closed-form function 
of the data. 
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12.9 Exercises 


12.1 


12.3 


12.4 


12.9 


Show that the gn x gn covariance matrix Xe defined in equation (12.07) is 
positive definite if and only if the g x g matrix X used to define it is positive 
definite. 


Prove the first result of equations (12.08) for an arbitrary p x q matrix A 
and an arbitrary r x s matrix B. Prove the second result for A and B as 
above, and for C and D arbitrary q xt and s x u matrices, respectively. Prove 
the third result in (12.08) for an arbitrary nonsingular p x p matrix A and 
nonsingular r x r matrix B. 


Give details of the interchanges of rows and columns needed to convert A® B 
into B & A, where A is p x q and B is r x s. 


If B is positive definite, show that IQ B is also positive definite, where I is an 
identity matrix of arbitrary dimension. What about B & I? If A is another 
positive definite matrix, is it the case that B & A is positive definite? 


Show explicitly that expression (12.06) provides the OLS estimates of the 
parameters of all the equations of the SUR system. 


Show explicitly that expression (12.14) for the GLS estimator of the para- 
meters of an SUR system follows from the estimating equations (12.13). 


Show that, for any two vectors a; and a2 in E?, the quantity ||a1 |? || Mia2ll?, 
where Mj is the orthogonal projection on to the orthogonal complement 
of aı in BE is equal to the square of a11a22 — a12@21, where a;; denotes the 
it? element of aj, for 7,7 = 1,2. 


Using only the properties of determinants listed at the end of the subsection on 
determinants in Section 12.2, show that the determinant of a positive definite 
matrix B is positive. (Hint: write B = AA.) Show further that, if B is 
positive semidefinite, without being positive definite, then its determinant 
must be zero. 


Suppose that m independent random variables, z;, each of which is distributed 
as N(0,1), are grouped into an m-vector z. Let x = w+ Az, where m is 
an m-vector and A is a nonsingular m x m matrix, and let Q = AA'. Show 
that the mean of the vector æ is u and its covariance matrix is 92. Then show 
that the density of æ is 


(20)-™/? a exp( -F — #)'277(a — )). (12.121) 


This extends the result of Exercise 4.5 for the bivariate normal density to the 
multivariate normal density. Hints: Remember that the joint density of m 
independent random variables is equal to the product of their densities, and 
use the result (12.29). 


Consider a univariate linear regression model in which the regressors may 
include lags of the dependent variable. Let y and u denote, respectively, the 
vectors of observations on the dependent variable and the error terms, and 
assume that u ~ N (0, o°In). Show that, even though the Jacobian matrix of 
the transformation (12.31) is not an identity matrix, the determinant of the 
Jacobian is unity. Then write down the loglikelihood function for this model. 
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12.11 


12.12 


12.13 


12.14 


12.15 


12.16 
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For simplicity, assume that any lagged values of the dependent variable prior 
to the sample period are observed. 


Consider a multivariate linear regression model of the form (12.28) in which 
the regressors may include lags of the dependent variables and the error terms 
are normally distributed. By ordering the data appropriately, show that the 
determinant of the Jacobian of the transformation (12.31) is equal to unity. 
Then explain why this implies that the loglikelihood function, conditional on 
pre-sample observations, can be written as (12.33). 


Let A and B be square matrices, of dimensions p x p and q x q, respectively. 
Use the properties of determinants given in Section 12.2 to show that the 
determinant of A ® B is equal to that of B@ A. 


Use this result, along with any other needed properties of determinants given 
in Section 12.2, to show that the determinant of X @ In is |X|”. 


Verify that the moment conditions (12.45) and the estimating equations 
(12.46) are equivalent. Show also that expressions (12.47) and (12.48) for 
the covariance matrix estimator for the nonlinear SUR model are equivalent. 
Explain how (12.48) is related to the expression (12.15) that corresponds to 
it in the linear case. 


The linear expenditure system is a system of demand equations that can be 
written as 


m+1 
_UPi., (a =~) 
—- T i . 


j 12.122 

Si E E ( ) 
Here, s;, for i = 1,...,m, is the share of total expenditure E spent on com- 
modity i conditional on E and the prices p;, for i = 1,...,m-+1. The equation 


indexed by i = m + 1 is omitted as redundant, because the sum of the expen- 
diture shares spent on all commodities is necessarily equal to 1. The model 
parameters are the aj, i = 1,..., m, the yi, i = 1,...,m + 1, and the m x m 
contemporaneous covariance matrix X. 


Express the system (12.122) as a linear SUR system by use of a suitable 
nonlinear reparametrization. The equations of the resulting system will be 
subject to a set of cross-equation restrictions. Express these restrictions in 
terms of the new parameters, and then set up a GNR in the manner of 
Section 12.3 that allows one to obtain restricted estimates of the a; and yi. 


Show that the estimating equations (12.60) are equivalent to the estimating 
equations (12.58). 


Show that the estimating equations (12.65) are equivalent to the equations 
that correspond to the equation-by-equation IV (or 2SLS) estimator for all 
the equations of the system jointly. 


The k x k matrix Xe (X7! @ Pyy)Xe given in expression (12.66) is positive 
semidefinite by construction. Show this property explicitly by expressing the 
matrix in the form A'A, where A is a matrix with k columns and at least 
k rows that should depend on a g x g nonsingular matrix W which satisfies 
the relation WW! = 571, 


Show that a positive semidefinite matrix expressed in the form A'A is positive 
definite if and only if A has full column rank. In the present case, the matrix A 
fails to have full column rank if and only if there exists a k-vector 6, different 
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12.17 


12.18 


12.19 


12.20 


12.21 


from zero, such that AG = 0. Since k = DF ki, we may write the vector 
B as [B1 i... i Bg], where 8; is a kj-vector for i = 1,...,g. Show that there 
exists a nonzero @ such that AG = 0 if and only if, for at least one i, there 
is a nonzero 3; such that Pw X;G; = 0, that is, if Pw X; does not have full 
column rank. 


Show that, if Py X; has full column rank, then there exists a unique solution 
of the estimating equations (12.60) for the parameters 3; of equation i. 


Consider the linear simultaneous equations model 


yt1 = P11 + Bo1Xt2 + 831X143 + V21Yt2 + uti 
yt2 = B12 + Boa Xt2 + Bao Xt4 + b52Xt5 + V21 yt + Ut2- 


(12.123) 


If this model is written in the matrix notation of (12.68), precisely what are 
the matrices B and I equal to? 


Demonstrate that, if each equation in the linear simultaneous equations model 
(12.54) is just identified, in the sense that the order condition for identification 
is satisfied as an equality, then the number of restrictions on the elements of 
the matrices I and B of the restricted reduced form (12.70) is exactly g?. In 
other words, demonstrate that the restricted and unrestricted reduced forms 
have the same number of parameters in this case. 


Show that all terms that depend on the matrix V of error terms in the finite- 
sample expression for n 1X] PwXı obtained from equation (12.76) tend to 
Zero as n —> OO. 


Consider the following p x q partitioned matrix 


Im Ate 
A= 
| O ral 


where m < min(p,q). Show that A has full column rank if and only if Ago 
has full column rank. Hint: In order to do so, one can show that the existence 
of a nonzero q-vector x such that Ax = 0 implies the existence of a nonzero 
(q — m)-vector x2 such that Ago x2 = 0, and vice versa. 


Consider equation (12.72), the first structural equation of the linear simultan- 
eous system (12.68), with the variables ordered as described in the discussion 
of the asymptotic identification of this equation. Let the matrices I and B 
of the full system (12.68) be partitioned as follows: 


1 Io2 
Bit Big 
B= and F=|—fa Di2], 
(0) Io9 


where 6811 is a ky1-vector, B12 and Bag are, respectively, ki; x (g — 1) and 
(l — k11) x (g — 1) matrices, G21 is a kg1-vector, and Io2, T12, and M22 are, 
respectively, 1 x (g — 1), k21 x (g — 1), and (g — k21 — 1) x (g — 1) matrices. 
Check that the restrictions imposed in this partitioning correspond correctly 
to the structure of (12.72). 
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Let PT! be partitioned as 
~ a iced po p°? 
g ~t? r! r? 2 


where the rows of I~ are partitioned in the same pattern as the columns 
of I’, and vice versa. Show that Tor is an identity matrix, and that 
Tə T}! is a zero matrix, and specify the dimensions of these matrices. Show 
also that the matrix |T 1} I°'”] is square and nonsingular. 


It was shown in Section 12.4 that the rank condition for the asymptotic iden- 
tification of equation (12.72) is that the (l — k11) x k21 matrix IT; of the 
unrestricted reduced form (12.73) should have full column rank. Show that, 
in terms of the structural parameters, [To is equal to Boor. Then consider 
the matrix 


T 
| 22 | (12.124) 
B22 


and show, by postmultiplying it by the nonsingular matrix |I" ig tajl that 
it is of full column rank g — 1 if and only if BəT!! is of full column rank. 
Conclude that the rank condition for the asymptotic identification of (12.72) 
is that (12.124) should have full column rank. 


Consider the expression (r'@ In)ye, in the notation of Section 12.5. Show 
that it is equal to a gn-vector that can be written as 


Yn 
Y ym 
where yi, i = 1,...,g, is the i* column of I. 


Show similarly that (T! & In)(Ig $ WB)y° is equal to a gn-vector that can 
be written as 
Wb: 
Wbm 
where b; is the it? column of B. 


Using these results, demonstrate that (TT @In)(ye — (Ig 9 WB)y°) is equal 
to Ye — Xe ße. Explain why this proves the result (12.108). 


By expressing the loglikelihood function (12.107) for the linear simultaneous 
equations model in terms of X rather than (2, show that concentrating the 
resulting function with respect to X yields the concentrated loglikelihood 
function (12.87). 


Write down the concentrated loglikelihood function for the restricted reduced 
form (12.70) as a special case of (12.51). Then show that this concentrated 
loglikelihood function is identical to expression (12.87). 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


12.9 Exercises 545 


12.26 


12.27 


12.28 


12.29 


12.30 


12.31 


In the model (12.123), what is the identification status of each of the two 
equations? How would your answer change if an additional regressor, Xt6, 
were added to the first equation only, to the second equation only, or to both 
equations? 


Consider the linear simultaneous system of equations (12.90) and (12.91). 
Write down the estimating equations for the 3SLS estimator for the system, 
and show that they define the same estimator of the parameters of (12.90) as 
the IV estimator applied to that equation alone with instruments W. 


State and prove the analogous result for an SUR system in which only one 
equation is overidentified. 


In the just-identified case of LIML estimation, for which, in the notation 
of (12.91), the number of excluded instruments in the matrix W4 is equal to 
the number of included endogenous variables in the matrix Y, show that the 
minimized value of the ratio « given by (12.92) is equal to the global minimum 
of 1. Show further that the vector of estimates Bo that attains this minimum 
is the IV, or 2SLS, estimator of G2 for equation (12.90) with instruments W. 


In the overidentified case of LIML estimation, explicitly formulate a model 
containing the model consisting of (12.90) and (12.91) as a special case, with 
the overidentifying restrictions relaxed. Show that the maximized loglikeli- 
hood for this unconstrained model is the same function of the data as for the 
constrained model, but with & replaced by 1. 


Consider the demand-supply model 


qt = G11 + Boi X42 + 831 X43 + yorpe + Uti 


(12.125) 
qt = G12 + Bao Xt + B52Xt5 + yoope + U2, 


where qt is the log of quantity, pz is the log of price, Xz2 is the log of income, 
X13 is a dummy variable that accounts for regular demand shifts, and X;4 and 
X:s are the prices of inputs. Thus the first equation of (12.125) is a demand 
function and the second equation is a supply function. 


For this model, precisely what is the vector Ge that was introduced in equation 
(12.55)? What are the matrices B and I that were introduced in equation 
(12.68)? How many overidentifying restrictions are there? 


The file demand-supply.data contains 120 observations generated by the model 
(12.125). Estimate this model by 2SLS, LIML, 3SLS, and FIML. In each case, 
test the overidentifying restrictions, either for each equation individually or 
for the whole system, as appropriate. 


The second equation of (12.125) can be rewritten as 


pt = B12 + Baa Xta + b52Xt5 + Y22qt + uta. (12.126) 


Estimate the system that consists of the first equation of (12.125) and equa- 
tion (12.126) by 3SLS and FIML. What is the relationship between the FIML 
estimates of this system and the FIML estimates of (12.125)? What is the 
relationship between the two sets of 35LS estimates? 
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Consider the system 
yi=B+yy2tu, yo= Writ, (12.127) 


in which the first equation is the only structural equation and the first column 
of W is a vector of 1s. For sample size n = 25, and for l = 2, 4,6,8, generate 
l— 1 additional instrumental variables as independent drawings from N(0, 1). 
Generate the endogenous variables yı and y2 using the DGP given by (12.127) 
with 8 = 1 and y = 1, mı an /-vector with every element equal to 1, and the 
2 x 2 contemporaneous covariance matrix X such that the diagonal elements 
are equal to 4, and the off-diagonal elements to 2. Estimate the parameters 
@ and y using both IV (2SLS) and LIML. 


Repeat the exercise many times and plot the empirical distributions of the 
two estimators of y. How do their properties vary with the degree of over- 
identification? 


Write down both the first-order conditions for minimizing the NLIV criterion 
function (12.103) and the usual estimate of the covariance matrix of oo, 
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13.1 Introduction 


Time-series data have special features that often require the use of special- 
ized econometric techniques. We have already dealt with some of these. For 
example, we discussed methods for dealing with serial correlation in Sections 
7.6 through 7.9 and in Section 10.7, and we discussed heteroskedasticity and 
autocorrelation consistent (HAC) covariance matrices in Section 9.3. In this 
chapter and the next, we discuss a variety of techniques that are commonly 
used to model, and test hypotheses about, economic time series. 


A first point concerns notation. In the time series literature, it is usual to refer 
to a variable, series, or process by its typical element. For instance, one may 
speak of a variable y; or a set of variables Y;, rather than defining a vector y 
or a matrix Y. We will make free use of this convention in our discussion of 
time series. 


The methods we will discuss fall naturally into two groups. Some of them are 
intended for use with stationary time series, and others are intended for use 
with nonstationary time series. We defined stationarity in Section 7.6. Recall 
that a random process for a time series 4; is said to be covariance stationary 
if the unconditional expectation and variance of y;, and the unconditional 
covariance between y; and y;_;, for any lag j, are the same for all t. In this 
chapter, we restrict our attention to time series that are covariance station- 
ary. Nonstationary time series and techniques for dealing with them will be 
discussed in Chapter 14. 


Section 13.2 discusses stochastic processes that can be used to model the 
way in which the conditional mean of a single time series evolves over time. 
These are based on the autoregressive and moving average processes that 
were introduced in Section 7.6. Section 13.3 discusses methods for estimating 
this sort of univariate time-series model. Section 13.4 then discusses single- 
equation dynamic regression models, which provide richer ways to model the 
relationships among time-series variables than do static regression models. 
Section 13.5 deals with seasonality and seasonal adjustment. Section 13.6 
discusses autoregressive conditional heteroskedasticity, which provides a way 
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to model the evolution of the conditional variance of a time series. Finally, 
Section 13.7 deals with vector autoregressions, which are a particularly simple 
and commonly used way to model multivariate time series. 


13.2 Autoregressive and Moving Average Processes 


In Section 7.6, we introduced the concept of a stochastic process and briefly 
discussed autoregressive and moving average processes. Our purpose there 
was to provide methods for modeling serial dependence in the error terms of a 
regression model. But these processes can also be used directly to model the 
dynamic evolution of an economic time series. When they are used for this 
purpose, it is common to add a constant term, because most economic time 
series do not have mean zero. 


Autoregressive Processes 


In Section 7.6, we discussed the pt! order autoregressive, or AR(p), process. If 
we add a constant term, such a process can be written, with slightly different 
notation, as 


Ye =Y + priye-1+ p2yt—-2t---+ PpYt-p +E e¢~UD(0,o2). (13.01) 


According to this specification, the ¢; are homoskedastic and uncorrelated 
innovations. Such a process is often referred to as white noise, by a peculiar 
mixed metaphor, of long standing, which cheerfully mixes a visual and an 
auditory image. Throughout this chapter, the notation €+ refers to a white 


noise process with variance o2. 


Note that the constant term y in equation (13.01) is not the unconditional 
mean of y;. We assume throughout this chapter that the processes we con- 
sider are covariance stationary, in the sense that was given to that term in 
Section 7.6. This implies that u = E(y:) does not depend on t. Thus, by 
equating the expectations of both sides of (13.01), we find that 


p 
w=yt n> pi 
i=1 


Solving this equation for u yields the result that 
7 


u= —=;. (13.02) 
1 — J% Pi 
If we define uz = y+ — u, it is then easy to see that 
p 
Ut = X Pitti + Et, (13.03) 
i=1 


which is exactly the definition (7.33) of an AR(p) process given in Section 7.6. 
In the lag operator notation we introduced in that section, equation (13.03) 
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can also be written as 
uz = p(L)u t+ ez, or as (1 — p(L))u = £, 


where the polynomial p is defined by equation (7.35), that is, p(z) = pız + 
p22? +... + Pp2”. Similarly, the expression for the unconditional mean p in 
equation (13.02) can be written as y/(1 — p(1)). 


The covariance matrix of the vector u of which the typical element is u was 
given in equation (7.32) for the case of an AR(1) process. The elements of this 
matrix are called the autocovariances of the AR(1) process. We introduced 
this term in Section 9.3 in the context of HAC covariance matrices, and its 
meaning here is similar. For an AR(p) process, the autocovariances and the 
corresponding autocorrelations can be computed by using a set of equations 
called the Yule-Walker equations. We discuss these equations in detail for an 
AR(2) process; the generalization to the AR(p) case is straightforward but 
algebraically more complicated. 


An AR(2) process without a constant term is defined by the equation 
Ut = P1Ut—1 + P2Ut—2 + Et. (13.04) 


Let vo denote the unconditional variance of u;, and let v; denote the covariance 
of uz and uz_;, for i = 1,2,.... Because the process is stationary, the v;, which 
are by definition the autocovariances of the AR(2) process, do not depend on t. 
Multiplying equation (13.04) by uz and taking expectations of both sides, we 
find that 

vo = P11 + pave + a (13.05) 


Because uz—; and uz—2 are uncorrelated with the innovation e+, the last term 
on the right-hand side here is E(u¢ez) = E(e?) = o2. Similarly, multiplying 
equation (13.04) by uz¢_1 and wz—2 and taking expectations, we find that 


vı = pivo + pov, and vo = p v1 + p2v0. (13.06) 


Equations (13.05) and (13.06) can be rewritten as a set of three simultaneous 
linear equations for vo, v1, and vo: 

Vo — P11 — P2V02 = o? 

P1V0 + (p2 = 1)vy = 0 (13.07) 


p2vo + P1U1 — V2 = Q. 


These equations are the first three Yule-Walker equations for the AR(2) pro- 
cess. As readers are asked to show in Exercise 13.1, their solution is 


Oz A TÈ y 2 
vo = pC -= p2) v= pee B=H (Pi + p2(1 — p2)), (13.08) 
where D = (1 + p2)(1 + pı — p2)(1 — pı — p2). 
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(0, 1) 


Pl 


(—2, —1) (2,-1) 


P2 


Figure 13.1 The stationarity triangle for an AR(2) process 


The result (13.08) makes it clear that pı and pz are not the autocorrelations of 
an AR(2) process. Recall that, for an AR(1) process, the same p that appears 
in the defining equation uz = puz¢_1 + €z is also the correlation of uz; and uz_1. 
This simple result does not generalize to higher-order processes. Similarly, 
the autocovariances and autocorrelations of u; and uz_; for i > 2 have a 
more complicated form for AR processes of order greater than 1. They can, 
however, be determined readily enough by using the Yule-Walker equations. 
Thus, if we multiply both sides of equation (13.04) by u,_; for any i > 2, and 
take expectations, we obtain the equation 


Ui = P1Vi-1 + P2Vi-2- 


Since vo, v1, and v2 are given by equations (13.08), this equation allows us to 
solve recursively for any v; with i > 2. 


Necessary conditions for the stationarity of the AR(2) process follow directly 
from equations (13.08). The 3 x 3 covariance matrix 


Vo U1 V2 
Vi VO Ui (13.09) 
U2 V Vo 


of any three consecutive elements of an AR(2) process must be a positive 
definite matrix. Otherwise, the solution (13.08) to the first three Yule-Walker 
equations, based on the hypothesis of stationarity, would make no sense. The 
denominator D evidently must not vanish if this solution is to be finite. In 
Exercise 12.3, readers are asked to show that the lines along which it vanishes 
in the plane of pı and 2 define the edges of a stationarity triangle such that 
the matrix (13.09) is positive definite only in the interior of this triangle. The 
stationarity triangle is shown in Figure 13.1. 
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Moving Average Processes 


A q order moving average, or MA(q), process with a constant term can be 
written as 
Ye = U + ages + Q1Et—1 +... + Ag Et~a; (13.10) 


where the e; are white noise, and the coefficient ag is generally normalized 
to 1 for purposes of identification. The expectation of the y, is readily seen 
to be u, and so we can write 


q 
ut = Yt — U = Et + X aye; = (1 + a(L))es, 
=i 


where the polynomial a is defined by a(z) = DS ajzi. 


The autocovariances of an MA process are much easier to calculate than those 
of an AR process. Since the €; are white noise, and hence uncorrelated, the 
variance of the uz is seen to be 


Var (uz) = E(u?) = o2 (1 + > =) (13.11) 


Similarly, the j*® order autocovariance is, for j > 0, 


o2(a; + Sa ajia) forj <q, 
E(urut-j) = 4 oaj for j = q, and (13.12) 
0 for j >q. 


Using (13.12) and (13.11), we can calculate the autocorrelation p(j) between 
yz and y—; for j > 0.1 We find that 


s oe ee l 
=- for j < q, = 0 otherwise, 13:13 
p(3) ES Jsa p) (13.13) 


where it is understood that, for 7 = q, the numerator is just a;. The fact that 
all of the autocorrelations are equal to 0 for j > q is sometimes convenient, 
but it suggests that q may often have to be large if an MA(q) model is to be 
satisfactory. Expression (13.13) also implies that q must be large if an MA(q) 
model is to display any autocorrelation coefficients that are big in absolute 
value. Recall from Section 7.6 that, for an MA(1) model, the largest possible 
absolute value of p(1) is only 0.5. 


1 The notation p is unfortunately in common use both for the parameters of an 
AR process and for the autocorrelations of an AR or MA process. We therefore 
distinguish between the parameter p; and the autocorrelation p(j). 
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If we want to allow for nonzero autocorrelations at all lags, we have to allow 
q to be infinite. This means replacing (13.10) by the infinite-order moving 
average process 


ut = Er + > geri = (1+ a(L)) er, (13.14) 
{=i 


where a(L) is no longer a polynomial, but rather a (formal) infinite power 
series in L. Of course, this MA(oo) process is impossible to estimate in 
practice. Nevertheless, it is of theoretical interest, provided that 


Var(uz) = o2 (1 + s a?) 
i=1 


is a finite quantity. A necessary and sufficient condition for this to be the case 
is that the coefficients a; are square summable, which means that 


q 
li 2 < o0. 13.15 
-a 00 (13.15) 


We will implicitly assume that all the MA(oo) processes we encounter satisfy 
condition (13.15). 


Any stationary AR(p) process can be represented as an MA(co) process. We 
will not attempt to prove this fundamental result in general, but we can easily 
show how it works in the case of a stationary AR(1) process. Such a process 
can be written as 

(1 = pil)ut = Et. 


The natural way to solve this equation for u, as a function of e+ is to multiply 
both sides by the inverse of 1 — pı L. The result is 


ue = (1 — pi LY t'er. (13.16) 


Formally, this is the solution we are seeking. But we need to explain what it 
means to invert 1 — pı L. 


In general, if A(L) and B(L) are power series in L, each including a constant 
term independent of L that is not necessarily equal to 1, then B(L) is the 
inverse of A(L) if B(L)A(L) = 1. Here the product B(L)A(L) is the infinite 
power series in L obtained by formally multiplying together the power series 
B(L) and A(L); see Exercise 13.5. The relation B(L)A(L) = 1 then requires 
that the result of this multiplication should be a series with only one term, 
the first. Moreover, this term, which corresponds to L°, must equal 1. 


We will not consider general methods for inverting a polynomial in the lag 
operator; see Hamilton (1994) or Hayashi (2000), among many others. In this 
particular case, though, the solution turns out to be 


(= phy Se pp he RL H... (13.17) 
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To see this, note that pı L times the right-hand side of equation (13.17) is the 
same series without the first term of 1. Thus, as required, 


(1—pil)* — pL -PLY = (1 - pil) - pil) = 1. 
We can now use this result to solve equation (13.16). We find that 
Ut = Et + P1Et-1 + Preto ga (13.18) 


It is clear that (13.18) is a special case of the MA(oo) process (13.14), with 
a; = pj for i = 0,...,00. Square summability of the a; is easy to check 
provided that |p1| < 1. 


In general, if we can write a stationary AR(p) process as 


where p(L) is a polynomial of degree p in the lag operator, then there exists 
an MA(oo) process 
ue = (1+ a(L))en, (13.20) 


where a(L) is an infinite series in L such that (1 — p(L))(1+ a(L)) = 1. This 
result provides an alternative way to the Yule-Walker equations to calculate 
the variance, autocovariances, and autocorrelations of an AR(p) process by 
using equations (13.11), (13.12), and (13.13), after we have solved for a(L). 
However, these methods make use of the theory of functions of a complex 
variable, and so they are not elementary. 


The close relationship between AR and MA processes goes both ways. If 
(13.20) is an MA(q) process that is invertible, then there exists a stationary 
AR(oo) process of the form (13.19) with 


(1 — p(L)) (1+ a(L)) =1. 


The condition for a moving average process to be invertible is formally the 
same as the condition for an autoregressive process to be stationary; see the 
discussion around equation (7.36). We require that all the roots of the poly- 
nomial equation 1 + a(z) = 0 must lie outside the unit circle. For an MA(1) 
process, the invertibility condition is simply that |a;| < 1. 


ARMA Processes 


If our objective is to model the evolution of a time series as parsimoniously as 
possible, it may well be desirable to employ a stochastic process that has both 
autoregressive and moving average components. This is the autoregressive 
moving average process, or ARMA process. In general, we can write an 
ARMA(p, q) process with nonzero mean as 


(1 — p(L))u = 7+ (1+ a(L))ers, (13.21) 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


554 Methods for Stationary Time-Series Data 


and a process with zero mean as 
(1 — p(L)) uz = (1 + a(L))es, (13.22) 


where p(L) and a(L) are, respectively, a p*™ order and a q*™! order polynomial 
in the lag operator, neither of which includes a constant term. If the process is 
stationary, the expectation of y given by (13.21) is 4 = y/(1 — p(1)), just as 
for the AR(p) process (13.01). Provided the autoregressive part is stationary 
and the moving average part is invertible, an ARMA (p, q) process can always 
be represented as either an MA(oo) or an AR(oo) process. 


The most commonly encountered ARMA process is the ARMA(1,1) process, 
which, when there is no constant term, has the form 


Ut = P1Ut-1 + Ep + QO E4-1- (13.23) 


This process has one autoregressive and one moving average parameter. 


The Yule-Walker method can be extended to compute the autocovariances 
of an ARMA process. We illustrate this for the ARMA(1,1) case and invite 
readers to generalize the procedure in Exercise 13.6. As before, we denote 
the it autocovariance by v;, and we let E(use;_;) = wi, for i = 0,1,.... 
Note that E(uze,) = 0 for all s > t. If we multiply (13.23) by €+ and take 
expectations, we see that wo = 02. If we then multiply (13.23) by e,-1 and 
repeat the process, we find that w1 = p1wo + a102, from which we conclude 
that wı = Ge (pı + a1). Although we do not need them at present, we note 
that the w; for i > 1 can be found by multiplying (13.23) by ¢:_;, which gives 
the recursion w; = p1Wwi—1, with solution w; = one ny + ay). 


Next, we imitate the way in which the Yule-Walker equations are set up for 
an AR process. Multiplying equation (13.23) first by uz and then by u¢_1, 
and subsequently taking expectations, gives 


Vo = pili + Wo + arwi = p11 + 02(1+ 01p1 + a7), and 
V1 = P1V0 + A Wo = P1V0 + aoz, 


where we have used the expressions for wọ and w, given in the previous 
paragraph. When these two equations are solved for vo and v1, they yield 


=y 1 + 2p1@1 +a? 


2 2 
a Qa Qa 

2 : emt E | 1+ paz + L 

l-i 


E 1- p? 


vo (13.24) 


Finally, multiplying equation (13.23) by uz_; for i > 1 and taking expectations 
gives v; = p1vi—1, from which we conclude that 


_ 2 pi (pi + pea + pia? + a1) 
=o, i 2 . 
— Pi 


vi (13.25) 
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Equation (13.25) provides all the autocovariances of an ARMA(1, 1) process. 
Using it and the first of equations (13.24), we can derive the autocorrelations. 


Autocorrelation Functions 


As we have seen, the autocorrelation between uz; and u,z—; can be calculated 
theoretically for any known stationary ARMA process. The autocorrelation 
function, or ACF, expresses the autocorrelation as a function of the lag 7 for 
j = 1,2.... If we have a sample y, t = 1,...,n, from an ARMA process 
of possibly unknown order, then the jt? order autocorrelation p(j) can be 
estimated by using the formula 


Cov(yt, Ytj) 


ÊC) a (13.26) 
where 
Cov(y, yj) = — Ba \(yt-; — 9), (13.27) 
£ j+1 
and 
Var(y:) = — Sh yt — 9). (13.28) 


In equations (13.27) and (13.28), y is ae mean of the y+. Of course, (13.28) 
is just the special case of (13.27) in which j = 0. It may seem odd to divide 
by n — 1 rather than by n — j — 1 in (13.27). However, if we did not use the 
same denominator for every j, the estimated autocorrelation matrix would 
not necessarily be positive definite. Because the denominator is the same, the 
factors of 1/(n — 1) cancel in the formula (13.26). 


The empirical ACF, or sample ACF, expresses the 6(/), defined in equation 
(13.26), as a function of the lag j. Graphing the sample ACF provides a 
convenient way to see what the pattern of serial dependence in any observed 
time series looks like, and it may help to suggest what sort of stochastic 
process would provide a good way to model the data. For example, if the 
data were generated by an MA(1) process, we would expect that 6(1) would 
be an estimate of a; and all the other (j) would be approximately equal to 
zero. If the data were generated by an AR(1) process with pı > 0, we would 
expect that 6(1) would be an estimate of pı and would be relatively large, the 
next few (j) would be progressively smaller, and the ones for large j would 
be approximately equal to zero. A graph of the sample ACF is sometimes 
called a correlogram; see Exercise 13.15. 

The partial autocorrelation function, or PACF, is another way to characterize 
the relationship between y; and its lagged values. The partial autocorrelation 
coefficient of order j is defined as the true value of the coefficient py ) in the 
linear regression 


y= y) + p ye Sle Grae de PP y + Et, (13.29) 
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or, equivalently, in the minimization problem 


yD, p9 


j 
min E(u -%0 — Yo ui). (13.30) 
i=1 

The superscript “(j)” appears on all the coefficients in regression (13.29) to 
make it plain that all the coefficients, not just the last one, are functions of j, 
the number of lags. We can calculate the empirical PACF, or sample PACF, 
up to order J by running regression (13.29) for j = 1,...,J and retaining 
only the estimate py ) for each j. Just as a graph of the sample ACF may 
help to suggest what sort of stochastic process would provide a good way to 
model the data, so a graph of the sample PACF, interpreted properly, may 
do the same. For example, if the data were generated by an AR(2) process, 
we would expect the first two partial autocorrelations to be relatively large, 
and all the remaining ones to be insignificantly different from zero. 


13.3 Estimating AR, MA, and ARMA Models 


All of the time-series models that we have discussed so far are special cases 
of an ARMA (p, q) model with a constant term, which can be written as 


p q 
Y= YT ` PiYt—i + Et + `> Qj Et—j, (13.31) 
i=1 j=1 


where the £; are assumed to be white noise. There are p+q+1 parameters to 
estimate in the model (13.31): the p;, for i = 1,...,p, the aj, for j = 1,...,q, 
and y. Recall that y is not the unconditional expectation of y unless all of 
the p; are zero. 


For our present purposes, it is perfectly convenient to work with models that 
allow y; to depend on exogenous explanatory variables and are therefore even 
more general than (13.31). Such models are sometimes referred to as ARMAX 
models. The ‘X’ indicates that y, depends on a row vector X; of exogenous 
variables as well as on its own lagged values. An ARMAX (p, q) model takes 
the form 

Yt = X:,B+u, u ~ ARMA (p,q), E(u) = 0, (13.32) 


where X; is the mean of y; conditional on X; but not conditional on lagged 
values of y}. The ARMA model (13.31) can evidently be recast in the form 
of the ARMAX model (13.32); see Exercise 13.13. 


Estimation of AR Models 


We have already studied a variety of ways of estimating the model (13.32) 
when w follows an AR(1) process. In Chapter 7, we discussed three estimation 
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methods. The first was estimation by a nonlinear regression, in which the 
first observation is dropped from the sample. The second was estimation by 
feasible GLS, possibly iterated, in which the first observation can be taken 
into account. The third was estimation by the GNR that corresponds to 
the nonlinear regression with an extra artificial observation corresponding to 
the first observation. It turned out that estimation by iterated feasible GLS 
and by this extended artificial regression, both taking the first observation 
into account, yield the same estimates. Then, in Chapter 10, we discussed 
estimation by maximum likelihood, and, in Exercise 10.21, we showed how to 
extend the GNR by yet another artificial observation in such a way that it 
provides the ML estimates if convergence is achieved. 


Similar estimation methods exist for models in which the error terms follow 
an AR(p) process with p > 1. The easiest method is just to drop the first p 
observations and estimate the nonlinear regression model 


p 
yt = Xb + X pilye—i — Xib) + & 


i=1 


by nonlinear least squares. If this is a pure time-series model for which 
X;3 = P, then this is equivalent to OLS estimation of the model 


p 
Yt = Y+ X piye—i + Et, 
=i 


where the relationship between y and ĝ is derived in Exercise 13.13. This 
approach is the simplest and most widely used for pure autoregressive models. 
It has the advantage that, although the p; (but not their estimates) must 
satisfy the necessary condition for stationarity, the error terms u; need not 
be stationary. This issue was mentioned in Section 7.8, in the context of the 
AR(1) model, where it was seen that the variance of the first error term uy 
must satisfy a certain condition for uz to be stationary. 


Maximum Likelihood Estimation 


If we are prepared to assume that u+ is indeed stationary, it is desirable not 
to lose the information in the first p observations. The most convenient way 
to achieve this goal is to use maximum likelihood under the assumption that 
the white noise process £+ is normal. In addition to using more information, 
maximum likelihood has the advantage that the estimates of the p; are auto- 
matically constrained to satisfy the stationarity conditions. 


For any ARMA (p, q) process in the error terms uz, the assumption that the £; 
are normally distributed implies that the uy are normally distributed, and so 
also the dependent variable y+, conditional on the explanatory variables. For 
an observed sample of size n from the ARMAX model (13.32), let y denote 
the n-vector of which the elements are y1,..., Yn. The expectation of y 
conditional on the explanatory variables is X@, where X is the n x k matrix 
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with typical row X;. Let Q denote the autocovariance matrix of the vector y. 
This matrix can be written as 


VO U1 U2 wee. Up= i 
U1 VO Ul «ee Un-2 
Q= U2 U1 VO tas Unes , (13.33) 
Un—1 Un-2 Un-3 :-- Vo 


where, as before, v; is the stationary covariance of u; and uz_;, and vo is 
the stationary variance of the us. Then, using expression (12.121) for the 
multivariate normal density, we see that the log of the joint density of the 
observed sample is 


-2log 2r — = log|2| — 4 (y - XP)" (y — XP). (13.34) 


In order to construct the loglikelihood function for the ARMAX model (13.32), 
the v; must be expressed as functions of the parameters p; and a; of the 
ARMA (p, q) process that generates the error terms. Doing this allows us to 
replace 92 in the log density (13.34) by a matrix function of these parameters. 
Unfortunately, a loglikelihood function in the form of (13.34) is difficult to 
work with, because of the presence of the n x n matrix 92. Most of the 
difficulty disappears if we can find an upper-triangular matrix W such that 
ww' = Qt, as was necessary when, in Section 7.8, we wished to estimate by 
feasible GLS a model like (13.32) with AR(1) errors. It then becomes possible 
to decompose expression (13.34) into a sum of contributions that are easier 
to work with than (13.34) itself. 


If the errors are generated by an AR(p) process, with no MA component, then 
such a matrix W is relatively easy to find, as we will illustrate in a moment 
for the AR(2) case. However, if an MA component is present, matters are 
more difficult. Even for MA(1) errors, the algebra is quite complicated — see 
Hamilton (1994, Chapter 5) for a convincing demonstration of this fact. For 
general ARMA (p, q) processes, the algebra is quite intractable. In such cases, 
a technique called the Kalman filter can be used to evaluate the successive con- 
tributions to the loglikelihood for given parameter values, and can thus serve 
as the basis of an algorithm for maximizing the loglikelihood. This technique, 
to which Hamilton (1994, Chapter 13) provides an accessible introduction, is 
unfortunately beyond the scope of this book. 


We now turn our attention to the case in which the errors follow an AR(2) 
process. In Section 7.8, we constructed a matrix W corresponding to the sta- 
tionary covariance matrix of an AR(1) process by finding n linear combina- 
tions of the error terms u; that were homoskedastic and serially uncorrelated. 
We perform a similar exercise for AR(2) errors here. This will show how to 
set about the necessary algebra for more general AR(p) processes. 
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Errors generated by an AR(2) process satisfy equation (13.04). Therefore, for 
t > 3, we can solve for €+ to obtain 


Ep = Ut — pitii — P2Ue_2, t=3,...,Nn. (13.35) 


Under the normality assumption, the fact that the €, are white noise means 
that they are mutually independent. Thus observations 3 through n make 
contributions to the loglikelihood of the form 


f(y", B, pi, P2, e) — 


— 5 log 2r — log os — za (uB) — pitt-1(B) — p2ty—2(B)) 


(13.36) 


where y* is the vector that consists of yı through y+, u:(B) = yt — X+ß, and 
o2 is as usual the variance of the e+. The contribution (13.36) is analogous to 
the contribution (10.85) for the AR(1) case. 

The variance of the first error term, u1, is just the stationary variance vo given 


by (13.08). We can therefore define £1 as o-u1/,/Uo, that is, 


D \1/2 
21 = ( ) tik, (13.37) 


where D was defined just after equations (13.08). By construction, <1 has the 
same variance o2 as the £c for t > 3. Since the e; are innovations, it follows 
that, for t > 1, e+ is independent of u1, and hence of £1. For the loglikelihood 
contribution from observation 1, we therefore take the log density of £1, plus 
a Jacobian term which is the log of the derivative of e with respect to uy. 
The result is readily seen to be 


£1(y1, B, P1, P2, Fe) = 


D D : 


2(8). (13.38) 


1 1 
— — log 2r — logo. + = lo 
ge aa PTa 202 (1 = ps) 


Finding a suitable expression for €% is a little trickier. What we seek is a linear 
combination of u; and us that has variance ae and is independent of u1. By 
construction, any such linear combination is independent of the £; for t > 2. 
A little algebra shows that the appropriate linear combination is 


(gra) (eae) 
oe | => uz — —u1). 
ve — vu? Vo 


Use of the explicit expressions for vo and vı given in equations (13.08) then 
shows that 


e2 = (1 — p2) 1/2 (uz u), (13.39) 


l- 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


560 Methods for Stationary Time-Series Data 


as readers are invited to check in Exercise 13.9. The derivative of £3 with 
respect to uz is (1 — p3)!/?, and so the contribution to the loglikelihood from 
observation 2 can be written as 


L(Y’, B, p1, P2, e) = — $ log 2m log o- 4 5 log(1 p>) 


lsp 2 
-z e) - P(A) 


(13.40) 


Summing the contributions (13.36), (13.38), and (13.40) gives the loglikeli- 
hood function for the entire sample. It may then be maximized with respect 
to B, pi, p2, and o2 by standard numerical methods. 


Exercise 13.10 asks readers to check that the n x n matrix W defined implicitly 
by the relation W'u = e, where the elements of e are defined by (13.35), 
(13.37), and (13.39), is indeed upper triangular and such that WW! is equal 
to 1/o2 times the inverse of the covariance matrix (13.33) for the v; that 
correspond to an AR(2) process. 


Estimation of MA and ARMA Models 


Just why moving average and ARMA models are more difficult to estimate 
than pure autoregressive models is apparent if we consider the MA(1) model 


Ye = H + Et — 11 4-1, (13.41) 
where for simplicity the only explanatory variable is a constant, and we have 
changed the sign of a;. For the first three observations, if we substitute 


recursively for €¢;-1, equation (13.41) can be written as 


Yı = H — Q1E0 + €1, 


yo = (1 + aı)u — ayı — Q1 E0 + £2, 


ys = (1 +aı + aî )u — aiy2 — aî yı — QF £o + £3. 
It is not difficult to see that, for arbitrary t, this becomes 


t-1 t—1 


ue = (Srai)n = ye Yt—s — a4 Eo + Et. (13.42) 


s=0 s=1 


Were it not for the presence of the unobserved £o, equation (13.42) would be 
a nonlinear regression model, albeit a rather complicated one in which the 
form of the regression function depends explicitly on t. 


This fact can be used to develop tractable methods for estimating a model 
where the errors have an MA component without going to the trouble of set- 
ting up the complicated loglikelihood. The estimates are not equal to ML es- 
timates, and are in general less efficient, although in some cases they are 
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asymptotically equivalent. The simplest approach, which is sometimes rather 
misleadingly called conditional least squares, is just to assume that any unob- 
served pre-sample innovations, such as £o, are equal to 0, an assumption that 
is harmless asymptotically. A more sophisticated approach is to “backcast” 
the pre-sample innovations from initial estimates of the other parameters and 
then run the nonlinear regression (13.42) conditional on the backcasts, that is, 
the backward forecasts. Yet another approach is to treat the unobserved in- 
novations as parameters to be estimated jointly by maximum likelihood with 
the parameters of the MA process and those of the regression function. 


Alternative statistical packages use a number of different methods for esti- 
mating models with ARMA errors, and they may therefore yield different 
estimates; see Newbold, Agiakloglou, and Miller (1994) for a more detailed 
account. Moreover, even if they provide the same estimates, different pack- 
ages may well provide different standard errors. In the case of ML estimation, 
for example, these may be based on the empirical Hessian estimator (10.42), 
the OPG estimator (10.44), or the sandwich estimator (10.45), among others. 
If the innovations are heteroskedastic, only the sandwich estimator is valid. 


A more detailed discussion of standard methods for estimating AR, MA, and 
ARMA models is beyond the scope of this book. Detailed treatments may 
be found in Box, Jenkins, and Reinsel (1994, Chapter 7), Hamilton (1994, 
Chapter 5), and Fuller (1995, Chapter 8), among others. 


Indirect Inference 


There is another approach to estimating ARMA models, which is unlikely to 
be used by statistical packages but is worthy of attention if the available sam- 
ple is not too small. It is an application of the method of indirect inference, 
which was developed by Smith (1993) and Gouriéroux, Monfort, and Renault 
(1993). The idea is that, when a model is difficult to estimate, there may be 
an auxiliary model that is not too different from the model of interest but 
is much easier to estimate. For any two such models, there must exist so- 
called binding functions that relate the parameters of the model of interest to 
those of the auxiliary model. The idea of indirect inference is to estimate the 
parameters of interest from the parameter estimates of the auxiliary model 
by using the relationships given by the binding functions. 


Because pure AR models are easy to estimate and can be used as auxiliary 
models, it is natural to use this approach with models that have an MA 
component. For simplicity, suppose the model of interest is the pure time- 
series MA(1) model (13.41), and the auxiliary model is the AR(1) model 


Ye = Y + PYt-1 + Ue, (13.43) 


which we estimate by OLS to obtain estimates 7 and J. Let us define the 
elementary zero function u(y, p) as yz — y — pyt-1. Then the estimating 
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equations satisfied by 7 and ĝ are 


X u(y p) =0 and Soy 1u(7,p) = 0. (13.44) 
t=2 


t=2 


If y; is indeed generated by (13.41) for particular values of u and a1, then we 
may define the pseudo-true values of the parameters y and p of the auxiliary 
model (13.43) as those values for which the expectations of the left-hand sides 
of equations (13.44) are zero. These equations can thus be interpreted as 
correctly specified, albeit inefficient, estimating equations for the pseudo-true 
values. The theory of Section 9.5 then shows that ¥ and / are consistent for 
the pseudo-true values and asymptotically normal, with asymptotic covariance 
matrix given by a version of the sandwich matrix (9.67). 


The pseudo-true values can be calculated as follows. Replacing y; and yt—1 
in the definition of u(y, p) by the expressions given by (13.41), we see that 


u(y, p) = (1— p)u — y+ E — (a1 + p)Et-1 + 1p er-2. (13.45) 


The expectation of the right-hand side of this equation is just (1 — p)u — 7. 
Similarly, the expectation of y+—1u:(y, p) can be seen to be 


u(i =pju=7) — oort p) o aip. 
Equating these expectations to zero shows us that the pseudo-true values are 


1+a? 


pe ra 
1+a? 


and p= (13.46) 


in terms of the true parameters u and ay. 


Equations (13.46) express the binding functions that link the parameters of 
model (13.41) to those of the auxiliary model (13.43). The indirect estimates 
ji and a, are obtained by solving these equations with y and p replaced by ¥ 
and ĝ. Note that, since the second equation of (13.46) is a quadratic equation 
for a; in terms of p, there are in general two solutions for a;, which may be 
complex. See Exercise 13.11 for further elucidation of this point. 


In order to estimate the covariance matrix of à and G1, we must first estimate 
the covariance matrix of Ẹ and ĝ. Let us define the n x 2 matrix Z as |e y_1], 
that is, a matrix of which the first column is a vector of 1s and the second the 
vector of the y: lagged. Then, since the Jacobian of the zero functions u(y, p) 
is just —Z, it is easy to see that the covariance matrix (9.67) becomes 


plim =(Z°Z) 1 Z'QZ(Z"Z)", (13.47) 


N— CO 


where §2 is the covariance matrix of the error terms u;, which are given by 
the u(y, p) evaluated at the pseudo-true values. If we drop the probability 
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limit and the factor of n~! in expression (13.47) and replace 2 by a suitable 
estimate, we obtain an estimate of the covariance matrix of ¥ and p. Instead 
of estimating 2 directly, it is convenient to employ a HAC estimator of the 
middle factor of expression (13.47).? Since, as can be seen from equation 
(13.45), the u, have nonzero autocovariances only up to order 2, it is natural 
in this case to use the Hansen-White estimator (9.37) with lag truncation 
parameter set equal to 2. Finally, an estimate of the covariance matrix of 
ji and âı can be obtained from the one for Ẹ and pf by the delta method 
(Section 5.6) using the relation (13.46) between the true and pseudo-true 
parameters. 


In this example, indirect inference is particularly simple because the auxiliary 
model (13.43) has just as many parameters as the model of interest (13.41). 
However, this will rarely be the case. We saw in Section 13.2 that a finite-order 
MA or ARMA process can always be represented by an AR(oo) process. This 
suggests that, when estimating an MA or ARMA model, we should use as an 
auxiliary model an AR(p) model with p substantially greater than the number 
of parameters in the model of interest. See Zinde-Walsh and Galbraith (1994, 
1997) for implementations of this approach. 


Clearly, indirect inference is impossible if the auxiliary model has fewer para- 
meters than the model of interest. If, as is commonly the case, it has more, 
then the parameters of the model of interest are overidentified. This means 
that we cannot just solve for them from the estimates of the auxiliary model. 
Instead, we need to minimize a suitable criterion function, so as to make the 
estimates of the auxiliary model as close as possible, in the appropriate sense, 
to the values implied by the parameter estimates of the model of interest. In 
the next paragraph, we explain how to do this in a very general setting. 


Let the estimates of the pseudo-true parameters be an /-vector Ê, let the 
parameters of the model of interest be a k-vector 0, and let the binding 
functions be an l-vector 6(@), with | > k. Then the indirect estimator of 8 is 
obtained by minimizing the quadratic form 


(Ê — b(0))' £- (Ê — b(0)) (13.48) 


with respect to 0, where Š is a consistent estimate of the I x l covariance 
matrix of Ê. Minimizing this quadratic form minimizes the length of the 
vector 3 — b(@) after that vector has been transformed so that its covariance 
matrix is approximately the identity matrix. 


Expression (13.48) looks very much like a criterion function for efficient GMM 
estimation. Not surprisingly, it can be shown that, under suitable regularity 


2 In this special case, an expression for 2 as a function of a, p, and o2 can be 
obtained from equation (13.45), so that we can estimate @ as a function of 
consistent estimates of those parameters. In most cases, however, it will be 
necessary to use a HAC estimator. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


564 Methods for Stationary Time-Series Data 


conditions, the minimized value of this criterion function is asymptotically 
distributed as x?(l— k). This provides a simple way to test the overidentifying 
restrictions that must hold if the model of interest actually generated the data. 
As with efficient GMM estimation, tests of restrictions on the vector 0 can 
be based on the difference between the restricted and unrestricted values of 
expression (13.48). 


In many applications, including general ARMA processes, it can be difficult or 
impossible to find tractable analytic expressions for the binding functions. In 
that case, they may be estimated by simulation. This works well if it is easy 
to draw simulated samples from DGPs in the model of interest, and also easy 
to estimate the auxiliary model. Simulations are then carried out as follows. 
In order to evaluate the criterion function (13.48) at a parameter vector 0, we 
draw S independent simulated data sets from the DGP characterized by @, 
and for each of them we compute the estimate 3*(0@) of the parameters of the 
auxiliary model. The binding functions are then estimated by 


£ 
b” (0) = S X 6:(0). 


We then use b*(0) in place of b(0) when we evaluate the criterion function 
(13.48). As with the method of simulated moments (Section 9.6), the same 
random numbers should be used to compute 8% for each given s and for all 8. 
Much more detailed discussions of indirect inference can be found in Smith 
(1993) and Gouriéroux, Monfort, and Renault (1993). 


Simulating ARMA Models 


Simulating data from an MA(q) process is trivially easy. For a sample of 


size n, one generates white-noise innovations €+ for t = —q+1,...,0,...,n, 
most commonly, but not necessarily, from the normal distribution. Then, for 
t=1,...,n, the simulated data are given by 
q 
uk = E + X ajej. 
j=1 


There is no need to worry about missing pre-sample innovations in the context 
of simulation, because they are simulated along with the other innovations. 


Simulating data from an AR(p) process is not quite so easy, because of the 
initial observations. Recursive simulation can be used for all but the first p 
observations, using the equation 


p 
uf = X piui tec. (13.49) 
i=l 


For an AR(1) process, the first simulated observation uj can be drawn from 
the stationary distribution of the process, by which we mean the unconditional 
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distribution of us. This distribution has mean zero and variance o2/(1 — p7). 
The remaining observations are then generated recursively. When p > 1, 
the first p observations must be drawn from the stationary distribution of p 
consecutive elements of the AR(p) series. This distribution has mean vector 
zero and covariance matrix 2 given by expression (13.33) with n = p. Once 
the specific form of this covariance matrix has been determined, perhaps by 
solving the Yule-Walker equations, and (2 has been evaluated for the spe- 
cific values of the p;, a p x p lower-triangular matrix A can be found such 
that AA' = 2; see the discussion of the multivariate normal distribution in 
Section 4.3. We then generate €p as a p-vector of white noise innovations 
and construct the p-vector u% of the first p observations as up = Aep. The 
remaining observations are then generated recursively. 


Since it may take considerable effort to find 2, a simpler technique is often 
used. One starts the recursion (13.49) for a large negative value of t with 
essentially arbitrary starting values, often zero. By making the starting value 
of t far enough in the past, the joint distribution of uj through u% can be 
made arbitrarily close to the stationary distribution. The values of už for 


nonpositive t are then discarded. 


Starting the recursion far in the past also works with an ARMA (p, q) model. 
However, at least for simple models, we can exploit the covariances computed 
by the extension of the Yule-Walker method discussed in Section 13.2. The 
process (13.22) can be written explicitly as 


p q 
u = » piu;_, tee + > QjEt—j. (13.50) 
i=1 j=1 


In order to be able to compute the už recursively, we need starting values for 
Ul yess) up and Ep—q+1; Ep- Given these, we can compute u51 by drawing 
the innovation €p+ı1 and using equation (13.50) for t = p+1,...,n. The 
starting values can be drawn from the joint stationary distribution character- 
ized by the autocovariances v; and covariances wj discussed in the previous 
section. In Exercise 13.12, readers are asked to find this distribution for the 
relatively simple ARMA(1, 1) case. 


13.4 Single-Equation Dynamic Models 


Economists often wish to model the relationship between the current value 
of a dependent variable y+, the current and lagged values of one or more 
independent variables, and, quite possibly, lagged values of y+ itself. This sort 
of model can be motivated in many ways. Perhaps it takes time for economic 
agents to perceive that the independent variables have changed, or perhaps it 
is costly for them to adjust their behavior. In this section, we briefly discuss 
a number of models of this type. For notational simplicity, we assume that 
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there is only one independent variable, denoted x+. In practice, of course, 
there is usually more than one such variable, but it will be obvious how to 
extend the models we discuss to handle this more general case. 


Distributed Lag Models 


When a dependent variable depends on current and lagged values of x+, but 
not on lagged values of itself, we have what is called a distributed lag model. 
When there is only one independent variable, plus a constant term, such a 
model can be written as 


q 
Ye = ô + >> 8; Tij +u, uz ~ ID(0, 0°), (13.51) 
j=0 
in which y; depends on the current value of z, and on q lagged values. The 
constant term 6 and the coefficients 3; are to be estimated. 


In many cases, x+ is positively correlated with some or all of the lagged values 
xy; for j > 1. In consequence, the OLS estimates of the p; in equation 
(13.51) may be quite imprecise. However, this is generally not a problem if 
we are merely interested in the long-run impact of changes in the independent 
variable. This long-run impact is 


q q 
Oy 
ini (13.52) 
j=0 j=0 


We can estimate (13.51) and then calculate the estimate ¥ using (13.52), or 
we can obtain ¥ directly by reparametrizing regression (13.51) as 


q 
Yt = 5+ yr + Y Bj (ae; — £4) + Ut. (13.53) 
j=l 
The advantage of this reparametrization is that the standard error of Ẹ is 
immediately available from the regression output. 


In Section 3.4, we derived an expression for the variance of a weighted sum 
of parameter estimates. Expression (3.33), which can be written in a more 
intuitive fashion as (3.68), can be applied directly to 4, which is an unweighted 
sum. If we do so, we find that 


q q joi 
Var (7) = u'Var(3)e = X Var(ĝ;) + aa > Cov(B;, Br), (13.54) 
j=0 j=1 k=0 


where the smallest value of j in the double summation is 1 rather than 0, 
because no valid value of k exists for j = 0. When x;_; is positively correlated 
with x:_, for all j # k, the covariance terms in (13.54) are generally all 
negative. When the correlations are large, these covariance terms can often 
be large in absolute value, so much so that Var(+) may be smaller than the 
variance of B; for some or all 7. If we are interested in the long-run impact of 
xz On yz, it is therefore perfectly sensible just to estimate equation (13.53). 
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The Partial Adjustment Model 


One popular alternative to distributed lag models like (13.51) is the partial 
adjustment model, which dates back at least to Nerlove (1958). Suppose that 
the desired level of an economic variable ys is y?. This desired level is assumed 
to depend on a vector of exogenous variables X; according to 


y= XB +e, e ~ ID(0,o2). (13.55) 


Because of adjustment costs, y; is not equal to y? in every period. Instead, it 
is assumed to adjust toward y? according to the equation 


ye — Ye-1 = (1—S)(yP —y-1) + 4%, 1% ~ IDO, 02), (13.56) 


where 6 is an adjustment parameter that is assumed to be positive and strictly 
less than 1. Solving (13.55) and (13.56) for y+, we find that 


Ye = Yr—-1 — (1— 6)ye-1 + (1 — 6) XB? + (1 — Oder + vr 


(13.57) 
= Xib + dy—1 + ut, 


where 3 = (1 — 6)G° and uw = (1 — d)ex + v+. Thus the partial adjustment 
model leads to a linear regression of y, on X, and y+—1. The coefficient of 
yz-1 is the adjustment parameter, and estimates of 8° can be obtained from 
the OLS estimates of B and 6. This model does not make sense if ô < 0 or if 
ô > 1. Moreover, when ô is close to 1, the implied speed of adjustment may 
be implausibly slow. 


Equation (13.57) can be solved for y; as a function of current and lagged 
values of X; and u. Under the assumption that |6| < 1, we find that 


Co 


Ut = `> 6) X,_;8 + oS OF Ut_j. 


j=0 j=0 


Thus we see that the partial adjustment model implies a particular form of 
distributed lag. However, in contrast to the model (13.51), y now depends on 
lagged values of the error terms u; as well as on lagged values of the exogenous 
variables X;. This makes sense in many cases. If the regressors affect y, via a 
distributed lag, and if the error terms reflect the combined influence of other 
regressors that have been omitted, then it is surely plausible that the omitted 
regressors would also affect y via a distributed lag. However, the restriction 
that the same distributed lag coefficients should apply to all the regressors 
and to the error terms may be excessively strong in many cases. 


The partial adjustment model is only one of many economic models that can 
be used to justify the inclusion of one or more lags of the dependent variables 
in regression functions. Others are discussed in Dhrymes (1971) and Hendry, 
Pagan, and Sargan (1984). We now consider a general family of regression 
models that include lagged dependent and lagged independent variables. 
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Autoregressive Distributed Lag Models 


For simplicity of notation, we will continue to discuss only models with a 
single independent variable, x+. In this case, an autoregressive distributed 
lag, or ADL, model can be written as 


P q 
Yt = Bo + > Paes + `> Ytj +, w~ IID(0, 07). (13.58) 
i=l j=0 


Because there are p lags on y; and q lags on q+, this is sometimes called an 
ADL(p, q) model. 


A widely encountered special case of (13.58) is the ADL(1,1) model 


Yt = bo + Aiye—-1 + YoUt + V1Tt—1 + Ut. (13.59) 


Because most results that are true for the ADL(1, 1) model are also true, with 
obvious modifications, for the more general ADL(p, q) model, we will largely 
confine our discussion to this special case. 


Although the ADL(1,1) model is quite simple, many commonly encountered 
models are special cases of it. When 3; = 7; = 0, we have a static regression 
model with IID errors; when yo = 7 = 0, we have a univariate AR(1) model; 
when 7; = 0, we have a partial adjustment model; when 7, = — 8170, we have 
a static regression model with AR(1) errors; and when 3, = 1 and y, = — 70, 
we have a model in first differences that can be written as 


Ay: = Bo + PAT + ur. 


Before we accept any of these special cases, it makes sense to test them 
against (13.59). This can be done by means of asymptotic t or F tests, which 
it may be wise to bootstrap when the sample size is not large. 


It is usually desirable to impose the condition that |61| < 1 in (13.59). Strictly 
speaking, this is not a stationarity condition, since we cannot expect y; to be 
stationary without imposing further conditions on the explanatory variable x+. 
However, it is easy to see that, if this condition is violated, the dependent 
variable y; exhibits explosive behavior. If the condition is satisfied, there may 
exist a long-run equilibrium relationship between y; and z+, which can be used 
to develop a particularly interesting reparametrization of (13.59). 


Suppose there exists an equilibrium value x° to which x; would converge as 
t — oo in the absence of shocks. Then, in the absence of the error terms uz, 
ye would converge to a steady-state long-run equilibrium value y° such that 


y? = Bo + fry? + (Yo + )z°. 


Solving this equation for y° as a function of x° yields 


o Bo o 
= H D= +AT’, 13.60 
A L= i 1l- db i ( ) 
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where 


ye OTH (13.61) 


1 = By 
This is the long-run derivative of y° with respect to x°, and it is an elasticity 
if both series are in logarithms. An estimate of \ can be computed directly 
from the estimates of the parameters of (13.59). Note that the result (13.60) 
and the definition (13.61) make sense only if the condition |81| < 1 is satisfied. 


Because it is so general, the ADL(p, q) model is a good place to start when 
attempting to specify a dynamic regression model. In many cases, setting 
p= q = 1 will be sufficiently general, but with quarterly data it may be wise 
to start with p = q = 4. Of course, we very often want to impose restrictions 
on such a model. Depending on how we write the model, different restrictions 
may naturally suggest themselves. These can be tested in the usual way by 
means of asymptotic F and t tests, which may be bootstrapped to improve 
their finite-sample properties. 


Error-Correction Models 


It is a straightforward exercise to check that the ADL(1, 1) model of equation 
(13.59) can be rewritten as 


Ayt = bo + (B1 — Dima — Ata) + PAT + ut, (13.62) 


where à was defined in (13.61). Equation (13.62) is called an error-correction 
model. It expresses the ADL(1,1) model in terms of an error-correction 
mechanism; both the model and mechanism are often abbreviated to ECM.’ 
Although the model (13.62) appears to be nonlinear, it is really just a repara- 
metrization of the linear model (13.59). If the latter is estimated by OLS, an 
appropriate GNR can be used to obtain the covariance matrix of the estimates 
of the parameters of (13.62). Alternatively, any good NLS package should do 
this for us if we start it at the OLS estimates. 


The difference between y—; and Ax:—ı in the ECM (13.62) measures the 
extent to which the long-run equilibrium relationship between x; and y; is 
not satisfied. Consequently, the parameter G, — 1 can be interpreted as the 
proportion of the resulting disequilibrium that is reflected in the movement of 
yz in one period. In this respect, 3; —1 is essentially the same as the parameter 
6 —1 of the partial adjustment model. The term (3) — 1)(y—1 — A@t-1) 
that appears in (13.62) is the error-correction term. Of course, many ADL 
models in addition to the ADL(1, 1) model can be rewritten as error-correction 
models. An important feature of error-correction models is that they can also 
be used with nonstationary data, as we will discuss in Chapter 14. 


3 Error-correction models were first used by Hendry and Anderson (1977) and 
Davidson, Hendry, Srba, and Yeo (1978). See Banerjee, Dolado, Galbraith, 
and Hendry (1993) for a detailed treatment. 
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13.5 Seasonality 


As we observed in Section 2.5, many economic time series display a regular 
pattern of seasonal variation over the course of every year. Seasonality, as 
such a pattern is called, may be caused by seasonal variation in the weather 
or by the timing of statutory holidays, school vacation periods, and so on. 
Many time series that are observed quarterly, monthly, weekly, or daily display 
some form of seasonality, and this can have important implications for applied 
econometric work. Failing to account properly for seasonality can easily cause 
us to make incorrect inferences, especially in dynamic models. 


There are two different ways to deal with seasonality in economic data. One 
approach is to try to model it explicitly. We might, for example, attempt 
to explain the seasonal variation in a dependent variable by the seasonal 
variation in some of the independent variables, perhaps including weather 
variables or, more commonly, seasonal dummy variables, which were discussed 
in Section 2.5. Alternatively, we can model the error terms as following a 
seasonal ARMA process, or we can explicitly estimate a seasonal ADL model. 


The second way to deal with seasonality is usually less satisfactory. It depends 
on the use of seasonally adjusted data, that is, data which have been massaged 
in such a way that they represent what the series would supposedly have been 
in the absence of seasonal variation. Indeed, many statistical agencies release 
only seasonally adjusted data for many time series, and economists often treat 
these data as if they were genuine. However, as we will see later in this section, 
using seasonally adjusted data can have unfortunate consequences. 


Seasonal ARMA Processes 


One way to deal with seasonality is to model the error terms of a regression 
model as following a seasonal ARMA process, that is, an ARMA process with 
nonzero coefficients only, or principally, at seasonal lags. In practice, purely 
autoregressive processes, with no moving average component, are generally 
used. The simplest and most commonly encountered example is the simple 
AR(4) process 

Ut = Paut—4 + Et, (13.63) 


where p4 is a parameter to be estimated, and, as usual, ¢; is white noise. 
Of course, this process makes sense only for quarterly data. Another purely 
seasonal AR process for quarterly data is the restricted AR(8) process 


Ut = p4Ut—4 + pgut—g t+ Et, (13.64) 


which is analogous to an AR(2) process for nonseasonal data. 


In many cases, error terms may exhibit both seasonal and nonseasonal serial 
correlation. This suggests combining a purely seasonal with a nonseasonal 
process. Suppose, for example, that we wish to combine an AR(1) process and 
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a simple AR(4) process. The most natural approach is probably to combine 
them multiplicatively. Using lag-operator notation, we obtain 


(1 E piL) (1 = paL yur = Et. 
This can be rewritten as 
Ut = p1Ut—1 + p4Ut—4 — P1Pp4Ut—5 F Et. (13.65) 


Notice that the coefficient of uz_5 in equation (13.65) is equal to the negative 
of the product of the coefficients of u,z_, and uz_4. This restriction can easily 
be tested. If it does not hold, then we should presumably consider more 
general ARMA processes with some coefficients at seasonal lags. 


If adequate account of seasonality is not taken, there is often evidence of 
fourth-order serial correlation in a regression model. Thus testing for it often 
provides a useful diagnostic test. Moreover, seasonal autoregressive processes 
provide a parsimonious way to model seasonal variation that is not explained 
by the regressors. The simple AR(4) process (13.63) uses only one extra para- 
meter, and the restricted AR(8) process (13.64) uses only two. However, just 
as evidence of first-order serial correlation does not mean that the error terms 
really follow an AR(1) process, evidence of fourth-order serial correlation does 
not mean that they really follow an AR(4) process. 


By themselves, seasonal ARMA processes cannot capture one important fea- 
ture of seasonality, namely, the fact that different seasons of the year have 
different characteristics: Summer is not just winter with a different label. 
However, an ARMA process makes no distinction among the dynamical pro- 
cesses associated with the different seasons. One simple way to alleviate this 
problem would be to use seasonal dummy variables as well as a seasonal 
ARMA process. Another potential difficulty is that the seasonal variation of 
many time series is not stationary, in which case a stationary ARMA process 
cannot adequately account for it. Trending seasonal variables may help to 
cope with nonstationary seasonality, as we will discuss shortly in the context 
of a specific example. 


Seasonal ADL Models 


Suppose we start with a static regression model in which y, equals X; 6 + uz 
and then add three quarterly dummy variables, s;; through s:3, assuming 
that there is a constant among the other explanatory variables. The dummies 
may be ordinary quarterly dummies, or else the modified dummies, defined 
in equations (2.50), that sum to zero over each year. We then allow the error 
term u, to follow the simple AR(4) process (13.63). Solving for uz—4 yields 
the nonlinear regression model 


3 


Yt = Payr—a + XB — pa Xı-4 B+ ` Ôj Stj + Et- (13.66) 
j=l 
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There are no lagged seasonal dummies in this model because they would be 
collinear with the existing regressors. 


Equation (13.66) is a special case of the seasonal ADL model 


3 


ye = Yaye—a + Xi + Xea ba +Y SjStj + Et, (13.67) 
j=1 


which is just a linear regression model in which y; depends on y_4, the three 
seasonal dummies, X;, and X;—4. Before accepting the model (13.66), one 
would always want to test the common factor restrictions that it imposes on 
(13.67); this can readily be done by using asymptotic F tests, as discussed in 
Section 7.9. One would almost certainly also want to estimate ADL models 
both more and less general than (13.67), especially if the common factor 
restrictions are rejected. For example, it would not be surprising if y,—1; and 
at least some components of X;—ı also belonged in the model, but it would 
also not be surprising if some components of X;_4 did not belong. 


Seasonally Adjusted Data 


Instead of attempting to model seasonality, many economists prefer to avoid 
dealing with it entirely by using seasonally adjusted data. Although the idea 
of seasonally adjusting a time series is intuitively appealing, it is very hard to 
do so in practice without resorting to highly unrealistic assumptions. Seasonal 
adjustment of a series yẹ makes sense if, for all t, we can write y = y? + yf, 
where yp is a time series that contains no seasonal variation at all, and yf is 
a time series that contains nothing but seasonal variation. However, this is 
surely an extreme assumption, which would be false in almost any economic 
model of seasonal variation that could reasonably be imagined. 


To make the discussion more concrete, consider Figure 13.2, which shows the 
logarithm of urban housing starts in Canada, quarterly, for the period 1966 to 
2001. The solid line represents the actual data, and the dotted line represents 
a seasonally adjusted series.* It is clear from the figure that housing starts 
in Canada are highly seasonal, with the first (winter) quarter usually having 
a much smaller number of starts than the other three quarters. There is 
also some indication that the magnitude of the seasonal variation may have 
become smaller in the latter part of the sample, perhaps because of changes 
in construction technology. 


Seasonal Adjustment by Regression 


In Section 2.5, we discussed the use of seasonal dummy variables to construct 
seasonally adjusted data by regression. Although this approach is easy to 


4 These data come from Statistics Canada. The actual data, which start in 1948, 
are from CANSIM series J6001, and the adjusted data, which start in 1966, 
are from CANSIM series J9001. 
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Figure 13.2 Urban housing starts in Canada, 1966-2001 


implement and easy to analyze, it has a number of disadvantages, and it is 
almost never used by official statistical agencies. 


One problem with the simplest form of seasonal adjustment by regression is 
that it does not allow the pattern of seasonality to change over time. However, 
as Figure 13.2 illustrates, seasonal patterns often seem to do precisely that. A 
natural way to model this is to add additional seasonal dummy variables that 
have been interacted with powers of a time trend that increases annually. In 
the case of quarterly data, such a trend would be 


t SE V1 13299 3 3 Foe) (13.68) 


The reason t, takes this rather odd form is that, when it is multiplied by the 
seasonal dummies, the resulting trending dummies always sum to zero over 
each year. If one simply multiplied seasonal dummies by an ordinary time 
trend, that would not be the case. 


Let S denote a matrix of seasonal dummies and seasonal dummies that have 
been interacted with powers of t, or, in the case of data at other than quarterly 
frequencies, whatever annually increasing trend term is appropriate. In the 
case of quarterly data, S would normally have 3, 6, 9, or maybe 12 columns. 
In the case of monthly data, it would normally have 11, 22, or 33 columns. In 
all cases, every one of the variables in S should sum to zero over each year. 
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Then, if y denotes the vector of observations on a series to be seasonally 
adjusted, we could run the regression 


y = 3+ S6+u (13.69) 


and estimate the seasonally adjusted series as y’ = y — Sô. Unfortunately, 
although equations like (13.69) often provide a reasonable approximation to 
observed seasonal patterns, they frequently fail to do so, as readers will find 
when they answer Exercise 13.17. 


Another problem with using seasonal dummies is that, as additional obser- 
vations become available, the estimates from the dummy variable regression 
will not stay the same. It is inevitable that, as the sample size increases, the 
estimates of 6 in equation (13.69) will change, and so every element of y’ will 
change every time a new observation becomes available. This is clearly a most 
undesirable feature from the point of view of users of official statistics. More- 
over, as the sample size gets larger, the number of trend terms may need to 
increase if a polynomial is to continue to provide an adequate approximation 
to changes in the pattern of seasonal variation. 


Seasonal Adjustment and Linear Filters 


The seasonal adjustment procedures that are actually used by statistical agen- 
cies tend to be very complicated. They attempt to deal with a host of practical 
problems, including changes in seasonal patterns over time, variations in the 
number of shopping days and the dates of holidays from year to year, and the 
fact that pre-sample and post-sample observations are not available. We will 
not attempt to discuss these methods at all. 


Although official methods of seasonal adjustment are very complicated, they 
can often be approximated remarkably well by much simpler procedures based 
on what are called linear filters. Let y be an n-vector of observations (often 
in logarithms rather than levels) on a series that has not been seasonally 
adjusted. Then a linear filter consists of an n x n matrix ®, with rows that 
sum to 1, such that the seasonally adjusted series y’ is equal to Py. Each row 
of the matrix ® consists of a vector of filter weights. Thus each element y; 
of the seasonally adjusted series is equal to a weighted average of current, 
leading, and lagged values of yz. 


Let us consider a simple example for quarterly data. Suppose we first create 
three-term and eleven-term moving averages 


5 

zai z 1 

Y = 3 (Yt-4 +y + Ya) and %= Tr y Yt+j- 
j=-5 


The difference between J; and jj is a rolling estimate of the amount by which 
the value of y, for the current quarter tends to differ from its average value 
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over the year. Thus one way to define a seasonally adjusted series would be 


* 


Yi = Yt — Yt + Ye 

Dys — .2424y, 4 + 0a a 
+ 0909 y4_1 + .7576 44 + .0909 4441 + 0909 y449 
+ 09094443 — 2424 y¢44 + 0909 4445. 


(13.70) 


This example corresponds to a linear filter in which, for 5 < p < n—5, the p** 
row of ® would consist first of p — 6 zeros, followed by the eleven coefficients 
that appear in (13.70), followed by n — p — 5 more zeros. 


Although this example is very simple, the basic approach that it illustrates 
may be found, in various modified forms, in almost all official seasonal adjust- 
ment procedures. The latter generally do not actually employ linear filters, 
but they do employ a number of moving averages in a way similar to the ex- 
ample. These moving averages tend to be longer than the ones in the example, 
and they often give progressively less weight to observations farther from t. 
An important feature of almost all seasonally adjusted data is that, as in the 
example, the weight given to y+ is generally well below 1. For more on the 
relationship between official procedures and ones based on linear filters, see 
Burridge and Wallis (1984) and Ghysels and Perron (1993). 


We have claimed that official seasonal adjustment procedures in most cases 
have much the same properties as linear filters applied to either the levels or 
the logarithms of the raw data. This assertion can be checked empirically 
by regressing a seasonally adjusted series on a number of leads and lags of 
the corresponding seasonally unadjusted series. If the assertion is accurate, 
such a regression should fit well, and the coefficients should have a distinctive 
pattern. The coefficient of the current value of the raw series should be fairly 
large but less than 1, the coefficients of seasonal lags and leads should be 
negative, and the coefficients of other lags and leads should be small and 
positive. In other words, the coefficients should resemble those in equation 
(13.70). In Exercise 13.17, readers are asked to see whether a linear filter 
provides a good approximation to the method actually used for seasonally 
adjusting the housing starts data. 


Consequences of Using Seasonally Adjusted Data 


The consequences of using seasonally adjusted data depend on how the data 
were actually generated and the nature of the procedures used for seasonal 
adjustment. For simplicity, we will suppose that 


y=yrty;, and X = Xə + Xz, 


where y, and X, contain all the seasonal variation in y and X, respectively, 
and Yo and X, contain all other economically interesting variation. Suppose 
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further that the DGP is 
Yo = X-ßo +u, u~IID(0,071). (13.71) 


Thus the economic relationship in which we are interested involves only the 
nonseasonal components of the data. 


If the same linear filter is applied to every series, the seasonally adjusted data 
are Py and ®X, and the OLS estimator using those data is 


Bs = (X'6'OX) 1 X'S Sy. (13.72) 


This looks very much like a GLS estimator, with the matrix P'S playing the 
role of the inverse covariance matrix. 


The properties of the estimator Ês defined in equation (13.72) depend on how 
the filter weights are chosen. Ideally, the filter would completely eliminate 
seasonality, so that 


By =Py, and PX = X.. 
In this ideal case, we see that 
Bs = (XJ @'X,) 1X] 6 Sy, 


Tel. -iy T]. (13.73) 
= Bo + (Xo BP BX.) Xo P Pu. 


If every column of X is exogenous, and not merely predetermined, it is clear 
that the second term in the last line here has expectation zero, which implies 
that E(Ĝs) = Bo. Thus we see that, under the exogeneity assumption, the 
OLS estimator that uses seasonally adjusted data is unbiased. But this is a 
very strong assumption for time-series data. 


Moreover, this estimator is not efficient. If the elements of u are actually 
homoskedastic and serially independent, as we assumed in (13.71), then the 
Gauss-Markov Theorem implies that the efficient estimator would be obtained 
by an OLS regression of yo on Xo. Instead, Gg is equivalent to the estimator 
from a certain GLS regression of yo on Xo. Of course, the efficient estimator 
is not feasible here, because we do not observe y and Xo. 


In many cases, we can prove consistency under much weaker assumptions than 
are needed to prove unbiasedness; see Sections 3.2 and 3.3. In particular, for 
OLS to be consistent, we usually just need the regressors to be predetermined. 
However, in the case of data that have been seasonally adjusted by means 
of a linear filter, this assumption is not sufficient. In fact, the exogeneity 
assumption that is needed in order to prove that Bs is unbiased is also needed 
in order to prove that it is consistent. From (13.73) it follows that 


m —1 
pimpe = fm (2 x! SBX.) plim (2 XIS u) l 


n— Co n— oo n— oo 
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provided we impose sufficient conditions for the probability limits to exist and 
be nonstochastic. The predeterminedness assumption (3.10) evidently does 
not allow us to claim that the second probability limit here is a zero vector. 
On the contrary, any correlation between error terms and regressors at leads 
and lags that are given nonzero weights by the filter generally causes it to be 
a nonzero vector. Therefore, the estimator Bs is inconsistent if the regressors 
are merely predetermined. 


Although the exogeneity assumption is always dubious in the case of time- 
series data, it is certainly false when the regressors include one or more lags 
of the dependent variable. There has been some work on the consequences 
of using seasonally adjusted data in this case; see Jaeger and Kunst (1990), 
Ghysels (1990), and Ghysels and Perron (1993), among others. It appears 
that, in models with a single lag of the dependent variable, estimates of the 
coefficient of the lagged dependent variable can be severely biased when sea- 
sonally adjusted data are used. This bias does not vanish as the sample size 
increases, and its magnitude can be substantial; see Davidson and MacKinnon 
(1993, Chapter 19) for an illustration. 


Seasonally adjusted data are very commonly used in applied econometric 
work. Indeed, it is difficult to avoid doing so in many cases, either because 
the actual data are not available or because it is the seasonally adjusted series 
that are really of interest. However, the results we have just discussed suggest 
that, especially for dynamic models, the undesirable consequences of using 
seasonally adjusted data may be quite severe. 


13.6 Autoregressive Conditional Heteroskedasticity 


With time-series data, it is not uncommon for least squares residuals to be 
quite small in absolute value for a number of successive periods of time, then 
much larger for a while, then smaller again, and so on. This phenomenon of 
time-varying volatility is often encountered in models for stock returns, foreign 
exchange rates, and other series that are determined in financial markets. 
Numerous models for dealing with this phenomenon have been proposed. One 
very popular approach is based on the concept of autoregressive conditional 
heteroskedasticity, or ARCH, that was introduced by Engle (1982). The basic 
idea of ARCH models is that the variance of the error term at time t depends 
on the realized values of the squared error terms in previous time periods. 


If uz; denotes the error term adhering to a regression model, which may be 
linear or nonlinear, and Q—ı denotes an information set that consists of data 
observed through period t — 1, then what is called an ARCH(q) process can 
be written as 


q 
U= TEn oF = E(u? | Q1) = ao + `> Aiu, (13.74) 
i=1 
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where a; > 0 fori = 0,1,...,q¢, and £+ is white noise with variance 1. Here and 
throughout this section, g+ is understood to be the positive square root of oĉ. 
The skedastic function for the ARCH(q) process is the rightmost expression 
in (13.74). Since this function depends on t, the model is, as its name claims, 
heteroskedastic. The term “conditional” is due to the fact that, unlike the 
skedastic functions we have so far encountered, the ARCH skedastic function 
is not exogenous, but merely predetermined. Thus the model prescribes the 
variance of u; conditional on the past of the process. 


Because the conditional variance of uz is a function of wz_1, it is clear that u 
and uz—1 are not independent. They are, however, uncorrelated: 


E(uzUt-1) = E(E(upur—1 | Qr-1)) = E(ue-104E (ex | Qw—1)) = 0, 


where we have used the facts that o; € Q+ 1 and that €+ is an innovation. 
Almost identical reasoning shows that E(u,;u,) = 0 for all s < t. Thus the 
ARCH process involves only heteroskedasticity, not serial correlation. 


If an ARCH(q) process is covariance stationary, then o?, the unconditional 
expectation of u?, exists and is independent of t. Under the stationarity 
assumption, we may take the unconditional expectation of the second equation 
of (13.74), from which we find that 


Therefore, 
2 a0 


7 eas 


The condition aa a; < 1 is required for o? to be positive, and so it is 
also a necessary condition for stationarity. It is of course necessary that the 
conditional variances ø? should be positive, and that is why we require that 
a; > 0 for all i. If that requirement were not satisfied, realizations of some of 
the o? could be negative. 


z (13.75) 


Unfortunately, the ARCH(q) process has not proven to be very satisfactory in 
applied work. Many financial time series display time-varying volatility that 
is highly persistent, but the correlation between successive values of u? is not 
very high; see Pagan (1996). In order to accommodate these two empirical 
regularities, q must be large. But if q is large, the ARCH(q) process has a 
lot of parameters to estimate, and the requirement that all the a; should be 
positive may not be satisfied if it is not explicitly imposed. 


GARCH Models 


The generalized ARCH model, which was proposed by Bollerslev (1986), is 
much more widely used than the original ARCH model. We may write a 
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GARCH(p, q) process as 


q p 
Ut =048;; 07 = E(u? | Q1) = ao + Y ati + Sos; OF; (13.76) 
i=1 j=l 


The conditional variance here can be written more compactly as 
o? = ag + a(L)u? + 6(L)o?, (13.77) 


where a(L) and 6(L) are polynomials in the lag operator L, neither of which 
includes a constant term. All of the parameters in the infinite-order auto- 


regressive representation 
1 


(1 -= 4(L)) 


must be nonnegative. Otherwise, as in the case of an ARCH(q) model with 
one or more of the a; < 0, we could have negative conditional variances. 


a(L) 


There is a strong resemblance between the GARCH(p, q) process (13.77) and 
the ARMA(p,q) process (13.21). In fact, if we let 6(L) = p(L), ao = 7, 
o? = y, and u? = cs, we see that the former becomes formally the same as 
an ARMA(p,q) process in which the coefficient of e; equals 0. However, the 
formal similarity between the two processes masks some important differences. 
In a GARCH process, the o? are not observable, and E(u?) = o? Æ 0. 


The simplest and by far the most popular GARCH model is the GARCH(1,1) 
process, for which the conditional variance can be written as 


o? = ag +. aru? + 6407.4. (13.78) 


Under the hypothesis of covariance stationarity, the unconditional variance 
a? can be found by taking the unconditional expectation of equation (13.78). 
We find that 


ao? = ao + ajo" + ĝo’. 


Solving this equation yields the result that 


2 ao 
= ———.., 13.79 
7 1 — Qa, — O41 ( ) 


For this unconditional variance to exist, it must be the case that a; +6, < 1, 
and for it to be positive, we require that ao > 0. 


The GARCH(1,1) process generally seems to work quite well in practice. In 
many cases, it cannot be rejected against any more general GARCH(p, q) 
process. An interesting empirical regularity is that the estimate âı is often 
small and positive, with the estimate 6; much larger, and the sum of the 
coefficients, &1 + bi, between 0.9 and 1. These parameter values imply that 
the time-varying volatility is highly persistent. 
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Testing for ARCH Errors 


It is easy to test a regression model for the presence of ARCH or GARCH 
errors. Imagine, for the moment, that we actually observe the u. Then we 
can replace a? by u? — e;, where e; is defined to be the difference between u? 
and its conditional expectation. This allows us to rewrite the GARCH(p, q) 


model (13.76) as 


max(p,q) p 
u? = ao + > (ai + iU; +e: — > Ôj Chg (13.80) 
i=1 j=l 


In this equation, we have replaced all of the o?_ j by u? j — et-j and then 
grouped the two summations that involve the u?_;. Of course, if p 4 q, either 
some of the a; or some of the 6; in the first summation are identically zero. 
Equation (13.80) can now be interpreted as a regression model with dependent 
variable u? and MA(p) errors. If one were actually to estimate (13.80), the 
MA structure would yield estimates of the 6;, and the estimated coefficients 
of the u?_; would then allow the a; to be estimated. 


Rather than estimating (13.80), it is easier to base a test on the Gauss-Newton 
regression that corresponds to (13.80), evaluated under the null hypothesis 
that a; = 0 for? = 1,...,q and ĝ; = 0 for j = 1,...,p. Since equation (13.80) 
is linear with respect to the a; and the 6;, the GNR is easy to derive. It is 


max(p,q) 

u? — ao = bo + `> biu?_; + residual. (13.81) 
i=1 

The artificial parameter bo here corresponds to the real parameter ag, and 
the b;, for i =1,...,max(p,q), correspond to the sums a; + 6;, because, under 
the null, the a; and 6; are not separately identifiable. In the regressand, ag 
would normally be the error variance estimated under the null. However, its 
value is irrelevant if we are using equation (13.81) for testing, because there 

is a constant term on the right-hand side. 


Under the alternative, the GNR should, strictly speaking, incorporate the 
MA structure of the error terms of (13.80). But, since these error terms are 
white noise under the null, a valid test can be constructed without taking 
account of the MA structure. The price to be paid for this simplification 
is that the a; and the 6; remain unidentified as separate parameters, which 
means that the test is the same for all GARCH(p,q) alternatives with the 
same value of max(p, q). 
In practice, of course, we do not observe the u+. But, as for the GNR-based 
tests against other types of heteroskedasticity that we discussed in Section 7.5, 
it is asymptotically valid to replace the unobserved u; by the least squares 
residuals ti. Thus the test regression is actually 
max(p,q) 
a? =bo + X biù}; + residual, (13.82) 
i=1 
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where we have arbitrarily set ag = 0. Because of the lags, this GNR, would 
normally be run over the last n — max(p, q) observations only. As usual, there 
are several possible test statistics. The easiest to compute is probably n times 
the centered R?, which is asymptotically distributed as y? (max(p, q)) under 
the null. It is also asymptotically valid to use the standard F statistic for all 
of the slope coefficients to be 0, treating it as if it followed the F distribution 
with max(p,q) and n — 2 max(p,q) — 1 degrees of freedom. These tests can 
easily be bootstrapped, and it is often wise to do so. We can use either a 
parametric or a semiparametric bootstrap DGP. 


Because it is very easy to compute a test statistic using regression (13.82), 
these tests are the most commonly used procedures to detect autoregressive 
conditional heteroskedasticity. However, other procedures may well perform 
better. In particular, Lee and King (1993) and Demos and Sentana (1998) 
have proposed various tests which take into account the fact that the alter- 
native hypothesis is one-sided. These one-sided tests have better power than 
tests based on the Gauss-Newton regression (13.82). 


The Stationary Distribution for ARCH and GARCH Processes 


In the case of an ARMA process, the stationary, or unconditional, distribution 
of the u, will be normal whenever the innovations £+ are normal white noise. 
However, this is not true for (G)ARCH processes, because the mapping from 
the e; to the u, is nonlinear. As we will see, the stationary distribution is 
not normal, and it may not even have a fourth moment. For simplicity, we 
will confine our attention to the fourth moment of the ARCH(1) process. 
Other moments of this process, and moments of the GARCH(1, 1) process, 
are treated in the exercises. 


For an ARCH(1) process with normal white noise innovations, or indeed any 
such (G)ARCH process, the distribution of uz is normal conditional on Qy-1. 
Since the variance of this distribution is gf, the fourth moment is 307, as we 
saw in Exercise 4.2. For an ARCH(1) process, o? = ap + a1u?_,. Therefore, 


E(ut | 1) = 3(ao + au) = 30% + 6apayur_, + 3a? Ue as 


If we assume that the unconditional fourth moment exists and denote it by ma, 
we can take the unconditional expectation of this relation to obtain 


6 2 
ma = 302 + — 


2 
+ 3a m4, 
— o 


where we have used the implication of equation (13.75) that the unconditional 
second moment is ag/(1 — aı). Solving this equation for m4, we find that 


3a2(1 +a) 
(1 = ay)(1 = 302) 


(13.83) 


m4 = 
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This result evidently cannot hold unless 3a? < 1. In fact, if this condition 
fails, the fourth moment does not exist. From the result (13.83), we can 
see that m4 > 304 = 3a2/(1 — a1)? whenever a; > 0. Thus, whatever the 
stationary distribution of u; might be, it certainly cannot be normal. At the 
time of writing there is, as far as the authors are aware, no explicit, analytical 
characterization of the stationary distribution for (G)ARCH processes. 


Estimating ARCH and GARCH Models 


Since (G)ARCH processes induce heteroskedasticity, it might seem natural 
to estimate a regression model with (G)ARCH errors by using feasible GLS. 
The first step would be to estimate the underlying regression model by OLS 
or NLS in order to obtain consistent but inefficient estimates of the regression 
parameters, along with least squares residuals ti. The second step would 
be to estimate the parameters of the (G)ARCH process by treating the ù? 
as if they were actual squared error terms and estimating a model with a 
specification something like (13.80), again by least squares. The final step 
would be to estimate the original regression model by feasible weighted least 
squares, using weights proportional to the inverse square roots of the fitted 
values from the model for the @?. 


This approach is very rarely used, because it is not asymptotically efficient. 
The skedastic function, which would, for example, be the right-hand side of 
equation (13.78) in the case of a GARCH(1, 1) model, depends on the lagged 
squared residuals, which in turn depend on the estimates of the regression 
function. Because of this, estimating both functions together yields more 
efficient estimates than estimating each of them conditional on estimates of 
the other; see Engle (1982). 


The most popular way to estimate models with GARCH errors is to assume 
that the error terms are normally distributed and use maximum likelihood. 
We can write a linear regression model with GARCH errors defined in terms 
of a normal innovation process as 

yt — Xb 


“018, 0) =E Ep N (0, 1), (13.84) 


where y; is the dependent variable, X; is a vector of exogenous or predeter- 
mined regressors, and @ is a vector of regression parameters. The skedastic 
function o7(8,0) is defined for some particular choice of p and q by equa- 
tion (13.76) with uz replaced by y — X;@. It therefore depends on 8 as well 
as on the a; and 6; that appear in (13.76), which we denote collectively by 8. 
The density of y conditional on Q;_1 is then 


1 yt — X48 
o1(8, 0) of o1(G, 0) i (13.85) 


where ¢(-) denotes the standard normal density. The first factor in (13.85) is 
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a Jacobian factor which reflects the fact that the derivative of e+ with respect 
to y is o7 '(G, 0); see Section 10.8. 


By taking the logarithm of expression (13.85), we find that the contribution 
to the loglikelihood function made by the tt} observation is 


— 2 
46,0) = — log Qn — + log(o?(6,0)) l San 
t ’ 


Unfortunately, it is not entirely straightforward to evaluate this expression. 
The problem is the skedastic function o?(3,@), which is defined implicitly by 
the recursion (13.77). This recursion does not constitute a complete definition 
because it does not provide starting values to initialize the recursion. In 
trying to find suitable starting values, we run into the difficulty, mentioned 
in the previous subsection, that there exists no closed-form expression for the 
stationary GARCH density. 


(13.86) 


If we are dealing with an ARCH(q) model, we can sidestep this problem by 
conditioning on the first q observations. Since, in this case, the skedastic 
function o?(6, 0) is determined completely by q lags of the squared residuals, 
there is no missing information for observations q + 1 through n. We can 
therefore sum the contributions (13.86) for just those observations, and then 
maximize the result. This leads to ML estimates conditional on the first q 
observations. But such a procedure works only for models with pure ARCH 
errors, and these models are very rarely used in practice. 


With a GARCH(p,q) model, p starting values of o? are needed in addition 
to q starting values of the squared residuals in order to initialize the recur- 
sion (13.77). It is therefore necessary to resort to some sort of ad hoc procedure 
to specify the starting values. A not very good idea is just to set all unknown 
pre-sample values of &? and o? to zero. A better idea is to replace them by an 
estimate of their common unconditional expectation. At least two different 
ways of doing this are in common use. The first is to replace the unconditional 
expectation by the appropriate function of the 0 parameters, which would be 
given by the rightmost expression in equations (13.79) for GARCH(1, 1). The 
second, which is easier, is just to use the sum of squared residuals from OLS 
estimation, divided by n. 


Another approach, similar to one we discussed for models with MA errors, 
is to treat the unknown starting values as extra parameters, and to max- 
imize the loglikelihood with respect to them, 8, and @ jointly. In all but 
huge samples, the choice of starting values can have a significant effect on the 
parameter estimates. Consequently, different programs for GARCH estima- 
tion can produce very different results. This unsatisfactory state of affairs, 
documented convincingly by Brooks, Burke, and Persand (2001), results from 
doing ML estimation conditional on different things. 


For any choice of starting values, maximizing a loglikelihood function obtained 
by summing the contributions (13.86) is not particularly easy, especially in 
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the case of GARCH models. Numerical difficulties seem to be quite common. 
It is vital to use analytical, rather than numerical, first derivatives, and for 
some algorithms it is highly desirable to use analytical second derivatives as 
well; these may be found in Fiorentini, Calzolari, and Panattoni (1996). Exer- 
cise 13.22 proposes an artificial regression, which makes use of first derivatives 
only. Not all software packages provide reliable estimates and standard errors; 
see McCullough and Renfro (1999) and Brooks, Burke, and Persand (2001). 
Therefore, we strongly recommend estimating this type of model more than 
once using different options and different computer programs. 


Although GARCH models have error terms with thicker tails than those of 
the normal distribution, data from financial markets often have tails even 
thicker than those implied by a GARCH model with normal e+. It is therefore 
quite common to modify (13.84) by assuming that the £; follow a distribution 
with thicker tails than the standard normal. One possibility is the Student’s t 
distribution with a small number of degrees of freedom, which may be chosen 
in advance or estimated. Maximum likelihood estimation then proceeds in 
the usual way. 


We can use any of the estimators discussed in Section 10.4 to estimate the 
covariance matrix of the ML estimates. One of these, the information matrix 
estimator, can be computed by means of the artificial regression that is in- 
troduced in Exercise 13.22. If the error terms are not distributed according 
to the normal or whatever distribution we have assumed, the ML estimates 
are still consistent, but they are not asymptotically efficient. In this case, the 
sandwich covariance matrix estimator (10.45) is consistent, but covariance 
matrix estimators that rely on the information matrix equality generally are 
not. A variant of the sandwich estimator specifically adapted to GARCH 
models was derived by Bollerslev and Wooldridge (1992). These and other 
possible variants are discussed and compared by Fiorentini, Calzolari, and 
Panattoni (1996).° 


Simulating ARCH and GARCH Models 


ARCH and GARCH models can be simulated recursively in much the same 
way as ARMA models. The successive values of the o? are computed on the 
basis of past realizations of the u? and a? series, and the us are generated 
as oE for a white-noise series e+, which is often but not always normal. 
However, the problem of finding suitable starting values for the recursion is 


5 Tt is stated in this paper and elsewhere in the literature that the information 
matrix is block diagonal with respect to 8 and @. This is misleading, since it is 
true only if the information matrix is defined using unconditional expectations. 
If the contribution of observation ¢ to the information matrix is computed as 
an expectation conditional on Q—1, as it should be for efficiency, then the 
information matrix is not block diagonal. 
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much harder for (G)ARCH models than for ARMA ones, because we cannot 
simply draw them from the stationary distribution. 


The easiest approach is the one already mentioned in the ARMA context, 
whereby one starts the recursion for some large negative t and discards the 
elements of the simulated series for nonpositive t. It is natural to set the 
initial values of ø? in this recursion to the unconditional expectation of the u? 
or, in the bootstrap case, to an estimate of this unconditional expectation. 
However, this approach is not entirely satisfactory for bootstrapping, where 
we wish to condition on the observed data as far as possible. One possibility 
would be to condition on the first max(p,q) observations, using the first q 
squared residuals as the initial values of u? and the first p squared residuals 
as the initial values of 7. However, since much work remains to be done 
on bootstrapping (G)ARCH models, we cannot recommend this or any other 
approach at the present time. 


Our discussion of autoregressive conditional heteroskedasticity has necessarily 
been quite superficial. There have been many extensions of the basic ARCH 
and GARCH models discussed here, among them the exponential GARCH 
model of Nelson (1991) and the absolute GARCH model of Hentschel (1995). 
These models are intended to explain empirical features of financial time series 
that the standard GARCH model cannot capture. More detailed treatments 
may be found in Bollerslev, Chou, and Kroner (1992), Bollerslev, Engle, and 
Nelson (1994), Hamilton (1994, Chapter 21), and Pagan (1996). 


13.7 Vector Autoregressions 


The dynamic models discussed in Section 13.4 were single-equation models. 
But we often want to model the dynamic relationships among several time- 
series variables. A simple way to do so without making many assumptions is 
to use what is called a vector autoregression, or VAR, model, which is the 
multivariate analog of an autoregressive model for a single time series. 


Let the 1 x g vector Y, denote the t* observation on a set of g variables. Then 
a vector autoregressive model of order p, sometimes referred to as a VAR(p) 
model, can be written as 


p 
Y, =a+)_ °¥,;6;+U;, U, ~HID(0, 5), (13.87) 


j=l 


where U, is a 1 x g vector of error terms, @ is a 1 x g vector of constant terms, 
and the ®,;, for j = 1,...,p, are g x g matrices of coefficients, all of which 
are to be estimated. If yp; denotes the i** element of Y, and jki denotes the 
kit® element of ®;, then the i*" column of (13.87) can be written as 


p m 


Yti = Qi + ` ` Yt—j,k Qj,ki + Uti- 


j=1 k=1 
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This is just a linear regression, in which y,; depends on a constant term and 
lags 1 through p of all of the g variables in the system. Thus we see that the 
VAR (13.87) has the form of a multivariate linear regression model, or SUR 
model, like the ones we discussed in Section 12.2. 


To see this clearly, let us make the definitions 


Xæ =| Via  Yış-p] and H= 
Pp 
The row vector X; has k = gp + 1 elements, and the matrix JI is k x g. With 
these definitions, the VAR. (13.87) becomes 


Y, = X%IMI +U, U;,~ IID(0, X), (13.88) 


which has the form of a multivariate regression model. In fact, if we stack 
the rows, it has precisely the same form as (12.71), which is the unrestricted 
reduced form for a linear simultaneous equations model. Thus a VAR can be 
thought of as a set of reduced form linear equations relating the endogenous 
variables in the vector Y, to the predetermined variables that are collected in 
the vector X;. Except for the constant term, these predetermined variables 
are the first p lags of all the endogenous variables themselves. 


Estimating a vector autoregression is very easy. As we saw in Section 12.2, 
it is appropriate to estimate a linear system like (13.88), in which the same 
regressors appear in every equation, by ordinary least squares. In such a 
case, OLS is both the efficient GLS estimator and the maximum likelihood 
estimator under the assumption of multivariate normal errors. If II denotes 
the matrix of OLS estimates, it follows from (12.41) that the maximized value 
of the loglikelihood function is 


-Z (log 2m + 1) — $ log |$], (13.89) 


where 


D= 


nm 
iY -XAY - XI =-)_U/U, (13.90) 
t=1 
is the ML estimate of the covariance matrix X. Here Y is the n x g matrix 
with typical row Y;, X is the n x k matrix with typical row X;, and U; is 
the row vector of OLS residuals for observation t. The estimate (13.90) is 
often of considerable interest, because it captures the covariances between the 
innovations in the various equations. 


When specifying a VAR, it is important to determine how many lags need to 
be included. If one wishes to test the null hypothesis that the longest lag in 
the system is p against the alternative that it is p+ 1, the easiest way to do 
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so is to compute the LR statistic 
n(log|¥(p)| — log |2°(p + 1)]), (13.91) 


where (p) and $(p + 1) denote the ML estimates of X for systems with p 
and p+ 1 lags, respectively; both of these may be computed using (13.90). 
This test statistic is asymptotically distributed as x?(g?). However, unless 
the sample size n is large relative to the number of parameters in the system 
(g + pg? under the null, and g + (p + 1)g? under the alternative), the finite- 
sample distribution of the LR statistic (13.91) may differ substantially from 
its asymptotic one. In consequence, this is a case in which it will often be 
very desirable to compute bootstrap rather than asymptotic P values. 


Since there is more than one way to generate bootstrap samples for a VAR, it 
is worth saying a bit more about this. We suggest using (13.87) to generate 
the data recursively, with OLS estimates under the null replacing the unknown 
parameters. The bootstrap error terms are obtained by resampling the row 
vectors U;, where U, is equal to (n/(n — 1 — gp))'/? times the row vector 
U; of OLS residuals, and actual pre-sample values of Y, are used to start 
the recursive process of generating the bootstrap data. Limited simulation 
evidence suggests that this procedure yields much more accurate P values for 
tests based on (13.91) than using the y?(g”) distribution. 


If we wish to construct confidence intervals for, or test hypotheses about, 
individual parameters in a VAR, we can use the OLS standard errors, which 
are asymptotically valid. Similarly, if we wish to test hypotheses concerning 
two or more parameters in a single equation, we can compute Wald tests in the 
usual way based on the OLS covariance matrix for that equation. However, if 
we wish to test hypotheses concerning coefficients in two or more equations, we 
need the covariance matrix of the parameter estimates for the entire system. 


We saw in Chapter 12 that the estimated covariance matrix for the feasi- 
ble GLS estimates of a multivariate regression model is given by expression 
(12.19), and the one for the ML estimates is given by expression (12.38). 
These two covariance matrices differ only because they use different estimates 
of X. As in Section 12.2, we let X. = I1,® X, which is a gn x gk matrix. Then, 
if all the parameters are stacked into a vector of length gk, both covariance 
matrices have the form 


(Cis eA. 


Using the rules for manipulating Kronecker products given in equations 
(12.08), we see that 


(XE @I1,)X.) = (I, @ XTE- @1,)(1y@X)) = V@(XTXY 1 


Thus the covariance matrix for all the coefficients of a VAR is easily computed 
from X, which is given in (13.90), and the inverse of the X'X matrix. 
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The idea of using vector autoregressions instead of structural models to model 
macroeconomic dynamics is often attributed to Sims (1980). Our treatment 
has been very brief. For a more detailed introductory treatment, with many 
references, see Liitkepohl (2001). For a review of macroeconomic applications 
of VARs, see Stock and Watson (2001). 


Granger Causality 


One common use of vector autoregressions is to test the hypothesis that one 
or more of the variables in a VAR do not “Granger cause” the others. The 
concept of Granger causality was developed by Granger (1969). Other, closely 
related, definitions of causality have been suggested, notably by Sims (1972). 
Suppose we divide the variables in a VAR into two groups, Y;; and Y;2, which 
are row vectors of dimensions gı and g2, respectively. Then we may say that 
Yi2 does not Granger cause Y;; if the distribution of Y;;, conditional on past 
values of both Y;; and Y;2, is the same as the distribution of Y;; conditional 
only on its own past values. 


In practice, it would be very difficult to test whether the entire distribution 
of Y,ı depends on past values of Y;2. Therefore, we almost always content 
ourselves with asking whether the conditional mean of Y;,; depends on past 
values of Yi2. In terms of the VAR (13.87), this is equivalent to imposing 
restrictions on the equations that correspond to Y;;. We can rewrite the 
VAR as 


P11 P5212 


Yi, Yoo] = 
Yn i] lar ol + $5.01 Bj 20 


P 
J= 


[Ys ¥.-ial| | + [Un U2], 


1 


where the matrices ; have been partitioned to conform with the partition of 
Y, and its lags. If Y; does not Granger cause Y;1, then all of the B; 21 must 
be zero matrices. Similarly, if Y;; does not Granger cause Y;2, then all of the 
Bj 12 must be zero matrices. 


Since the ®; 2; appear only in the equations for Yj, it is easy to test the 
hypothesis that they are all zero. We obtain ML estimates of the two systems 
of equations 


P 
Ya =a +X Y;-;,18;,11 + Un, and (13.92) 
j=1 
P 
Yı = a + 5 (en Bija + Yi; 2 ®j,21) + Un, (13.93) 
j=1 


which may be done using OLS for each equation, and then calculate the 
value of the loglikelihood function for each of the systems. As in (13.89), the 
loglikelihood depends only on the estimate of X11, the gı x gı upper left-hand 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


13.8 Final Remarks 589 


block of X. This may easily be calculated using the OLS residuals, as in 
(13.90). We obtain the LR statistic 


n(log (51l = log |2111), (13.94) 


where X41 denotes the estimate of 11, based on the OLS residuals from equa- 
tions (13.92), and Bia denotes the estimate of X11 based on the OLS residuals 
from equations (13.93). The statistic (13.94) is asymptotically distributed as 
x?(pgigz), but more reliable inferences in finite samples can almost certainly 
be obtained by bootstrapping. 


In practice, we are very commonly interested in testing Granger causality 
for a single dependent variable. In that case, equations (13.92) and (13.93) 
are univariate regressions. The restricted model, equation (13.92), becomes a 
regression of y+ on a constant and p of its own lagged values. The unrestricted 
model, equation (13.93), adds p lagged values of gz additional variables to this 
regression. We can then perform an asymptotic F test of the hypothesis that 
the pgz coefficients of the lags of all the additional variables are jointly equal 
to zero. For this test to be asymptotically valid, the error terms must be 
homoskedastic. If this assumption does not seem to be correct, we should 
instead perform a heteroskedasticity-robust test, as discussed in Section 6.8. 


Our discussion of Granger causality has been quite brief. Hamilton (1994, 
Chapter 11) provides a much more detailed discussion of this topic. That 
book also discusses a number of other aspects of VAR models in more detail 
than we have done here. 


13.8 Final Remarks 


The analysis of time-series data has engaged the interest of a great many 
statisticians and econometricians and generated a massive literature. This 
chapter has provided only a superficial introduction to the subject. In partic- 
ular, we have said nothing at all about frequency domain methods, because 
they are a bit too specialized for this book. See Brockwell and Davis (1991), 
Box, Jenkins, and Reinsel (1994, Chapter 2), Hamilton (1994, Chapter 6), 
and Fuller (1995), among many others. 


This chapter has dealt only with stationary time series. A great many econ- 
omic time series are, or at least appear to be, nonstationary. Therefore, in the 
next chapter, we turn our attention to methods for dealing with nonstation- 
ary time series. Such methods have been a subject of an enormous amount of 
research in econometrics during the past two decades. 
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13.9 Exercises 


13.1 


13.2 


13.3 


13.4 


Show that the solution to the Yule-Walker equations (13.07) for the AR(2) 
process is given by equations (13.08). 

Demonstrate that the first p+ 1 Yule-Walker equations for the AR(p) process 
Ut = D PiUt—i + Et are 


p 
> 2 
vo — PiVi = 0e, and 
i=1 


P 
pivo — vi + ` PFU i—j] =0, i=1,...,p. (13.95) 
j=1,jži 


Then rewrite these equations using matrix notation. 
Consider the AR(2) process 


Ut = p1Ut—1 + P2Ut—2 + Et, 


for which the covariance matrix (13.09) of three consecutive observations has 
elements specified by equations (13.08). Show that necessary conditions for 
stationarity are that pı and pə lie inside the stationarity triangle which is 
shown in Figure 13.1 and defined by the inequalities 


pı +p2 <1, pe— pi <1, and p2 > —1. 


This can be done by showing that, outside the stationarity triangle, the matrix 
(13.09) is not positive definite. 


Show that, along the edges p1 + pg = 1 and pı — pg = —1 of the AR(2) 
stationarity triangle, both roots of the polynomial 1 — p,z — p2 z? are real, 
one of them equal to 1 and the other greater than 1 in absolute value. Show 
further that, along the edge pg = —1, both roots are complex and equal to 1 
in absolute value. How are these facts related to the general condition for the 
stationarity of an AR process? 


Let A(z) and B(z) be two formal infinite power series in z, as follows: 


oO Co 
A(z) = > ajz? and B(z) = Sa z, 
i=0 j=0 
Let the formal product A(z)B(z) be expressed similarly as the infinite series 
OO 
C(z) = `> ckz” 
k=0 


Show that the coefficients c are given by the convolution of the coefficients 
a; and bj, according to the formula 


k 


p= > Gibi k=0,1,.... 
i=0 
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13.6 


13.7 


13.9 


13.10 


13.11 


13.12 


13.13 


13.14 


Show that the method illustrated in Section 13.2 for obtaining the auto- 
covariances of an ARMA(1,1) process can be extended to the ARMA(p, q) 
case. Since explicit formulas are hard to obtain for general p and q, it is 
enough to indicate a recursive method for obtaining the solution. 


Plot the autocorrelation function for the ARMA(2, 1) process 
Ut = P1Ut—-1 + P2Ut—2 + Et + ALEt-1 


for lags 7 = 0,1,...,20 and for parameter values pı = 0.8, pg = —0.6, and 
a, = 0.5. Repeat the exercise with p2 = 0, the other two parameters being 
unchanged, in order to see how the moving average component affects the 
ACF in this case. 


Consider the p Yule-Walker equations (13.95) for an AR(p) process as a set 
of simultaneous linear equations for the p;, 7 = 1,...,p, given the auto- 
covariances vi, i = 0,1,...,p. Show that the p; which solve these equations 
for given v; are also the solutions to the first-order conditions for the prob- 
lem (13.30) used to define the partial autocorrelation coefficients for a process 
characterized by the autocovariances v;. Use this result to explain why the 
pt? partial autocorrelation coefficient for a given stationary process depends 
only on the first p (ordinary) autocorrelation coefficients. 


Show that £2, as given by expression (13.39), has variance oz and is indepen- 
dent of both €; as given by (13.37) and the cez for t > 2. 


Define the n x n matrix W so that Wlu = e€, where the elements of the 
n-vector £ are defined by equations (13.35), (13.37), and (13.39). Show that 
W is upper triangular, and write down the matrix yy, Explain how ww" is 
related to the inverse of the covariance matrix (13.33), where the autocovari- 
ances v; are those of the AR(2) process ut = p1Ut—1 + p2Ut—2 + Et. 

Show that the second equation in (13.46) has real solutions for a1 in terms 
of p only if |p| < 0.5. Explain why this makes sense. Show that if p = +0.5, 
then a; = +1. Show finally that, if |p| < 0.5, exactly one of the solutions 
for a1 satisfies the invertibility condition that |a| < 1. 


The ARMA(1, 1) process 


Ut = p1Ut—1 + Et + Q1Et—1, Et ~ NID(0, o2), 


can be simulated recursively if we have starting values for ui and £1, which 
in turn can be generated from the joint stationary distribution of these two 
random variables. Characterize this joint distribution. 


Rewrite the ARMA (p, q) model (13.31) in the form of the ARMAX (p, q) model 
(13.32) with X+8 = 8. Show precisely how £ is related to y. 


Consider the MAX(1) model 
yt = X48 + Et — et_1. 


Show how to estimate the parameters of this model by indirect inference using 
as auxiliary model the nonlinear regression corresponding to AR(1) errors, 


Yt = Xey + pyz—-1 — PXt-17 + ut. 
In particular, show that, for true parameter values 8 and a, the pseudo-true 


values are y = 3 and p = —a/(1+ a°). 
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13.15 


13.16 


13.17 


13.18 


13.19 


13.20 
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This question uses the data in the file intrates-m.data, which contains four 
monthly interest rate series for the United States from 1955 to 2001. Take 
the first difference of two of these series, the federal funds rate, rf, and the 
10-year treasury bond rate, ri, Then graph both the empirical ACF and the 
empirical PACF of each of the differenced series for J = 24 for the period from 
1957:1 to 2001:12. Does it seem likely that an AR(1) process would provide 
a good model for either of these series? What about an MA(1) process? 


For the two series rf and rl used in the previous exercise, estimate AR(1), 
AR(2), MA (1), ARMA (1,1), ARMA (2, 1), and ARMA (2, 2) models with con- 
stant terms by maximum likelihood and record the values of the loglikelihood 
functions. In each case, which is the most parsimonious model that seems to 
be compatible with the data? 


The file hstarts.data contains the housing starts data graphed in Figure 13.2. 
For the period 1966:1 to 2001:4, regress the unadjusted series hz on a constant, 
hy—1, the three seasonal dummies defined in (2.50), those dummies interacted 
with the elements of the trend vector T defined in (13.68), and those dummies 
interacted with the squares of the elements of ty. Then test the null hypothesis 
that the error terms for this regression are serially independent against the 
alternative that they follow the simple AR(4) process (13.63). 


For the period 1966:1 to 1999:4, regress the adjusted series h} on the unad- 
justed series hz, a constant, and the nine seasonal dummy variables used in 
the previous regression. 


For the period 1966:1 to 1999:4, run the regression 


8 
ht = Bo + `> Ojhe + ut. 
j=-8 


Compare the performance of this regression with that of the dummy variable 
regression you just estimated. Which of them provides a better approximation 
to the way in which the seasonally adjusted data were actually generated? 


Consider the GARCH(1,1) model with conditional variance given by equa- 
tion (13.78). Calculate the unconditional fourth moment of the stationary 
distribution of the series uz generated as uz = otet with e+ ~ NID(0,1). It 
may be advisable to begin by calculating the unconditional fourth moment 
of the stationary distribution of ot. What is the necessary condition for the 
existence of these fourth moments? Show that, when the parameter 61 is zero, 
this condition becomes 3a? < 1, as for an ARCH(1) process. 


This exercise is an extension of Exercise 4.2. By considering the derivative 
of the function z?”*1¢(z), where ¢(-) is the standard normal density, and 
using an inductive argument, show that the (2r)th moment of the N(0,1) 
distribution is equal to [jar 25 —1). 


Use the result of the previous exercise to show that a necessary condition for 
the existence of the 2r*" moment of the ARCH(1) process 


Ut = Ot Et; o? = Q@o + aiu; et ~ NID(0, 1) 


is that aĵ [jai 23 —1)<1. 
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13.21 Consider the regression model y = X8 + u, where X is an n x k matrix, 
in which the errors follow a GARCH(1,1) process with conditional variance 
(13.78). Show that the skedastic function o? (6,0) used in the loglikelihood 
contribution ¢;(3,0) given in (13.86) can be written explicitly as 

t—1 
ag(1 — ô! = = 
o2(B,0) = SET 4 a, Sot tubs + 0201 +01), 


s=1 
2 2. —1 xon 2 
where uz stands for the residual y — X+ 8, oŭ is defined as n Saai uf, and 
all unavailable instances of both u? and o? are replaced by cĉ. 


Then show that the first-order partial derivatives of 4(8,0) can be written 


as follows: 
t-1 
Ol, Ol: Our Ili Oo? ou Xu? av ( si 
ðB Ou OB” do2 B o? 204 a1) <a 
nm 
+2(a,+ ô) 1 n+ > Xue), 
t=] 
Ol, Əh ðo? _ (uz — o7)(1— 64) 
ðao  ðg? Jag 204(1 — 61) 
t—1 
oh oh do? u? — o? ( 2  cs—1 2 t—1 
= = Uuf—sô + 07,6 ) 13.96 
ðaı Ao? Oat 20% 2 ecd rece ( ) 


Al, Əh Oo? u? = taod!  ao(l— 54) 
064 do? 064 204 1 T Oy (1 = 61)? 


t—1 
tor Y e= iua + 08 (edt tU- 1)a1817®))). 
s=1 


13.22 Consider the following artificial regression in connection with the model with 
GARCH(1, 1) errors considered in the preceding exercise. Each real obser- 
vation corresponds to two artificial observations. For observation t, the two 


corresponding elements of the regressand are 
2 
ut/ot and (uf — o7)/(oF 2). 


The elements of the regressors corresponding the the elements of 6 are the 
elements of 


x JE hm Å. 
t s—1 t—1 —1 
ae and = — oF (a1 do ut—s Xt-30, + (a1 +61)d; n > Xiu) ; 


For ag, the elements of the regressor are 0 and (1 — ôt) / (o? V2(1 — ô1)). For 
the regressor for a1, the elements are 0 and 


t—1 
1 ( s—l 2 t—1 
—— ) Ut—sO +ð ). 
o? 5 an t—s 01 u“ 
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13.24 


13.25 


13.26 


13.27 
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Finally, for 61, the elements of the corresponding regressor are 0 and 


1 ( tags, © | ao(1 — dt) 
a? V2 1-6, ` (1-8)? 
t—1 
+a, SUG — Lup sôi? + on (tët +(t— 1)a181 7?) 
s=1 


Show that, when the regressand is orthogonal to the regressors, the partial 
derivatives (13.96) are zero. Let R(G, 60) denote the 2n x (k +3) matrix of the 
regressors, and let Ê and Ê be the ML estimates. Show that R'(3,6)R({, ô) 
is the information matrix, where the contribution from observation t is com- 
puted as an expectation conditional on the information set Q4. 


This question uses data on monthly returns for the period 1969-1998 for 
shares of General Electric Corporation from the file monthly-crsp.data. These 
data are made available by courtesy of the Center for Research in Security 
Prices (CRSP); see the comments at the bottom of the file. Let Ry denote 
the return on GE shares in month t. For the entire sample period, regress 
Rı on a constant and d, where d is a dummy variable that is equal to 1 
in November, December, January, and February, and equal to O in all other 
months. Then test the hypothesis that the error terms are IID against the 
alternative that they follow a GARCH(1, 1) process. 


Using the data from the previous question, estimate the GARCH(1, 1) model 
Ri = Bı + Bodt tut, of = E(u?) = ao + ayuz_1+ 6107-1. (13.97) 


Estimate this model by maximum likelihood, and perform an asymptotic Wald 
test of the hypothesis that a, + 6, = 1. Then calculate the unconditional 
variance o° given by (13.79) and construct a .95 confidence interval for it. 
Compare this with the estimate of the unconditional variance from the linear 
regression model estimated in the previous question. 


Using the ML estimates of the model (13.97) from the previous question, plot 
both &? and the estimated conditional variance G6? against time. Put both 
series on the same axes. Comment on the relationship between the two series. 


Define the rescaled residuals from the model (13.97) as êt = û+/ô+. Plot the 
EDF of the rescaled residuals on the same axes as the CDF of the standard 
normal distribution. Does there appear to be any evidence that the rescaled 
residuals are not normally distributed? 


The file intrates-q.data contains quarterly data for 1955 to 2001 on four US 
interest rate series. Take first differences of these four series and, using data 
for the period 1957:1 to 2001:4, estimate a vector autoregression with two 
lags. Then estimate a VAR with three lags and test the hypothesis that p, 
the maximum lag, is equal to 2 at the .05 level. 


Using the same first-differenced data as in the previous question, and using 
models with two lags, test the hypothesis that the federal funds rate does 
not Granger cause the 10-year bond rate. Then test the hypothesis that the 
10-year bond rate does not Granger cause the federal funds rate. Perform 
both tests in two different ways, one of which assumes that the error variance 
is constant and one of which allows for heteroskedasticity of unknown form. 
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Chapter 14 


Unit Roots and 
Cointegration 


14.1 Introduction 


In this chapter, we turn our attention to models for a particular type of non- 
stationary time series. For present purposes, the usual definition of covariance 
stationarity is too strict. We consider instead an asymptotic version, which 
requires only that, as t — oo, the first and second moments tend to fixed 
stationary values, and the covariances of the elements y; and ys tend to sta- 
tionary values that depend only on |t— s|. Such a series is said to be integrated 
to order zero, or I(0), for a reason that will be clear in a moment. 


A nonstationary time series is said to be integrated to order one, or I(1),' 
if the series of its first differences, Ay, = y+ — yz-1, is 1(0). More generally, 
a series is integrated to order d, or I(d), if it must be differenced d times 
before an I(0) series results. A series is I(1) if it contains what is called a 
unit root, a concept that we will elucidate in the next section. As we will 
see there, using standard regression methods with variables that are I(1) can 
yield highly misleading results. It is therefore important to be able to test 
the hypothesis that a time series has a unit root. In Sections 14.3 and 14.4, 
we discuss a number of ways of doing so. Section 14.5 introduces the concept 
of cointegration, a phenomenon whereby two or more series with unit roots 
may be related, and discusses estimation in this context. Section 14.6 then 
discusses three ways of testing for the presence of cointegration. 


14.2 Random Walks and Unit Roots 


The asymptotic results we have developed so far depend on various regularity 
conditions that are violated if nonstationary time series are included in the 
set of variables in a model. In such cases, specialized econometric methods 
must be employed that are strikingly different from those we have studied 


l In the literature, such series are usually described as being integrated of order 
one, but this usage strikes us as being needlessly ungrammatical. 
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so far. The fundamental building block for many of these methods is the 
standardized random walk process, which is defined as follows in terms of a 
unit-variance white-noise process €+: 


we = W1 + Et, Wo = 0, Et ~ IID(0, 1). (14.01) 


Equation (14.01) is a recursion that can easily be solved to give 


Ẹ 
we= > es. (14.02) 
s=1 


It follows from (14.02) that the unconditional expectation E(w;) = 0 for all t. 
In addition, w; satisfies the martingale property that E(w; | Q:—1) = w:—1 for 
all t, where as usual the information set Q—1ı contains all information that is 
available at time t— 1, including in particular w;_;. The martingale property 
often makes economic sense, especially in the study of financial markets. We 
use the notation w, here partly because “w” is the first letter of “walk” and 
partly because a random walk is the discrete-time analog of a continuous-time 
stochastic process called a Wiener process, which plays a very important role 
in the asymptotic theory of nonstationary time series. 


The clearest way to see that w; is nonstationary is to compute Var(w;). Since 
€+ is white noise, we see directly that Var(w;) = t. Not only does this variance 
depend on t, thus violating the stationarity condition, but, in addition, it 
actually tends to infinity as t — oo, so that w; cannot be I(0). 


Although the standardized random walk process (14.01) is very simple, more 
realistic models are closely related to it. In practice, for example, an economic 
time series is unlikely to have variance 1. Thus the very simplest nonstationary 
time-series process for data that we might actually observe is the random walk 
process 


Y =p- Fe Yyo=0, e ~ID(0,o%), (14.03) 


where e; is still white noise, but with arbitrary variance ø?. This process, 


which is often simply referred to as a random walk, can be based on the process 
(14.01) using the equation y = ow. If we wish to relax the assumption that 
yo = 0, we can subtract yo from both sides of the equation so as to obtain the 
relationship 


Yt — Yo = Yt-1 — Yo + er. 
The equation y = yo + ow, then relates y; to a series w; generated by the 
standardized random walk process (14.01). 


The next obvious generalization is to add a constant term. If we do so, we 
obtain the model 


Ye = V1 + Ye-1 + et. (14.04) 
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This model is often called a random walk with drift, and the constant term 
is called a drift parameter. To understand this terminology, subtract yo + 91t 
from both sides of (14.04). This yields 


Yt — Yo eH + Yt-1 a €t 
= Yt-1 — Yo — (t — 1) +e, 


and it follows that y% can be generated by the equation y = yo + yt + ow. 
The trend term qıt is the drift in this process. 


It is clear that, if we take first differences of the y; generated by a process like 
(14.03) or (14.04), we obtain a time series that is I(0). In the latter case, for 
example, 


Ayt = Yt — Yt-1 = V1 + ee 


Thus we see that y; is integrated to order one, or I(1). This property is the 
result of the fact that y, has a unit root. 


The term “unit root” comes from the fact that the random walk process 
(14.03) can be expressed as 


(= L) = er, (14.05) 


where L denotes the lag operator. As we saw in Sections 7.6 and 13.2, an 
autoregressive process u; always satisfies an equation of the form 


(1 — p(L) Ju = et, (14.06) 


where p(L) is a polynomial in the lag operator L with no constant term, and 
ez is white noise. The process (14.06) is stationary if and only if all the roots 
of the polynomial equation 1 — p(z) = 0 lie strictly outside the unit circle in 
the complex plane, that is, are greater than 1 in absolute value. A root that 
is equal to 1 is called a unit root. Any series that has precisely one such root, 
with all other roots outside the unit circle, is an I(1) process, as readers are 
asked to check in Exercise 14.2. 


A random walk process like (14.05) is a particularly simple example of an AR 
process with a unit root. A slightly more complicated example is 


Ye = (1+ p2)Yt—1 — poye-2 + ut, |p2| < 1, 


which is an AR(2) process with only one free parameter. In this case, the 
polynomial in the lag operator is 1 — (1 + p2)L + pol? = (1 — L)(1 — pL), 
and its roots are 1 and 1/p2 > 1. 
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Same-Order Notation 


Before we can discuss models in which one or more of the regressors has a 
unit root, it is necessary to introduce the concept of the same-order relation 
and its associated notation. Almost all of the quantities that we encounter in 
econometrics depend on the sample size. In many cases, when we are using 
asymptotic theory, the only thing about these quantities that concerns us is 
the rate at which they change as the sample size changes. The same-order 
relation provides a very convenient way to deal with such cases. 


To begin with, let us suppose that f(n) is a real-valued function of the positive 
integer n, and p is a rational number. Then we say that f(n) is of the same 
order as n? if there exists a constant K, independent of n, and a positive 
integer N such that 

f(n) 


nP 


< K foraln>N. 


When f(n) is of the same order as n”, we can write 


f(n) = O(n”). 


Of course, this equation does not express an equality in the usual sense. But, 
as we will see in a moment, this “big O” notation is often very convenient. 


The definition we have just given is appropriate only if f(n) is a deterministic 
function. However, in most econometric applications, some or all of the quan- 
tities with which we are concerned are stochastic rather than deterministic. 
To deal with such quantities, we need to make use of the stochastic same- 
order relation. Let {an} be a sequence of random variables indexed by the 
positive integer n. Then we say that a, is of order n? in probability if, for all 
E€ > 0, there exist a constant K and a positive integer N such that 


P(t 
nP 


When an is of order n” in probability, we can write 


> K) <e foralln>N. (14.07) 


an = Op(n”). 


In most cases, it is obvious that a quantity is stochastic, and there is no 
harm in writing O(n?) when we really mean O,(n”). The properties of the 
same-order relations are the same in the deterministic and stochastic cases. 


The same-order relations are useful because we can manipulate them as if 
they were simply powers of n. Suppose, for example, that we are dealing with 
two functions, f(n) and g(n), which are O(n”) and O(n), respectively. Then 
O(n?*4), and 

O(n™2*(P-D), 


5 
= 
x 
3 
I 
z 
i 
& 
3 
2 
I 


(14.08) 


= 
= 
+ 
a 
2 
I 
= 
z3 
mee 
+ 
= 
z3 
k 
I 
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In the first line here, we see that the order of the product of the two functions 
is just n raised to the sum of p and q. In the second line, we see that the order 
of the sum of the functions is just n raised to the maximum of p and q. Both 
these properties of the same-order relations are often very useful in asymptotic 
analysis. 


Let us see how the same-order relations can be applied to a linear regression 
model that satisfies the standard assumptions for consistency and asymptotic 
normality. We start with the standard result, from equations (3.05), that 


B= Bo + (X'X) Xu. 


In Chapters 3 and 4, we made the assumption that n~'X 'X has a probability 
limit of Sytx, which is a finite, positive definite, deterministic matrix; recall 
equations (3.17) and (4.49). It follows readily from the definition (3.15) of a 
probability limit that each element of the matrix n~1X'X is O,(1). Simi- 
larly, in order to apply a central limit theorem, we supposed that n~!/2X'u 
has a probability limit which is a normally distributed random variable with 
expectation zero and finite variance; recall equation (4.53). This implies that 
n™t/2X'u = 0, (1). 


The definition (14.07) lets us rewrite the above results as 
X'X =O0,(n) and X'uw=O,(n'/?). (14.09) 
From equations (14.09) and the first of equations (14.08), we see that 
n/2(8 — Bo) = n! P (XXY X "u = n¥/?20,(n-1)O,(n¥/) = Op (1). 


This result is not at all new; in fact, it follows from equation (6.38) specialized 
to a linear regression. But it is clear that the O, notation provides a simple 
way of seeing why we have to multiply B — Bo by n'/?, rather than some other 
power of n, in order to find its asymptotic distribution. 


As this example illustrates, in the asymptotic analysis of econometric models 
for which all variables satisfy standard regularity conditions, p is generally 
—1, —t, 0, 5, or 1. For models in which some or all variables have a unit 
root, however, we will encounter several other values of p. 


Regressors with a Unit Root 


Whenever a variable with a unit root is used as a regressor in a linear regression 
model, the standard assumptions that we have made for asymptotic analysis 
are violated. In particular, we have assumed up to now that, for the linear 
regression model y = X8 + u, the probability limit of the matrix n~1X'X 
is the finite, positive definite matrix Sxtx. But this assumption is false 
whenever one or more of the regressors have a unit root. 
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To see this, consider the simplest case. Whenever w+ is one of the regressors, 
one element of X'X is S>/_, w?, which by equation (14.02) is equal to 


el r= 


t 
eres) ; (14.10) 
1 s=1 


The expectation of ¢,¢, is zero for r Æ s. Therefore, only terms with r = s 
contribute to the expectation of (14.10), which, since E(e?) = 1, is 


n 


YY Ele = Do t= Sn(n +2). (14.11) 


t=1 r=1 t=1 


Here we have used a result concerning the sum of the first n positive inte- 
gers that readers are asked to demonstrate in Exercise 14.3. Let w denote 
the n-vector with typical element w;. Then the expectation of n~'w!'w is 
(n+ 1)/2, which is evidently O(n). It is therefore impossible that n~!w!w 
should have a finite probability limit. 


This fact has extremely serious consequences for asymptotic analysis. It im- 
plies that none of the results on consistency and asymptotic normality that 
we have discussed up to now is applicable to models where one or more of the 
regressors have a unit root. All such results have been based on the assump- 
tion that the matrix n~'X'X, or the analogs of this matrix for nonlinear 
regression models, models estimated by IV and GMM, and models estimated 
by maximum likelihood, tends to a finite, positive definite matrix. It is con- 
sequently very important to know whether or not an economic variable has 
a unit root. A few of the many techniques for answering this question will 
be discussed in the next section. In the next subsection, we investigate some 
of the phenomena that arise when the usual regularity conditions for linear 
regression models are not satisfied. 


Spurious Regressions 


If x, and y are time series that are entirely independent of each other, we 
might hope that running the simple linear regression 


yt = Pi t+ Pore t ve (14.12) 


would usually produce an insignificant estimate of 32 and an R? near 0. How- 
ever, this is so only under quite restrictive conditions on the nature of the x+ 
and y+. In particular, if x, and y; are independent random walks, the t statis- 
tic for G2 = 0 does not follow the Student’s t or standard normal distribution, 
even asymptotically. Instead, its absolute value tends to become larger and 
larger as the sample size n increases. Ultimately, as n — oo, it rejects the 
null hypothesis that 3. = 0 with probability 1. Moreover, the R? does not 
converge to 0 but to a random, positive number that varies from sample to 
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Figure 14.1 Rejection frequencies for spurious and valid regressions 


sample. When a regression model like (14.12) appears to find relationships 
that do not really exist, it is called a spurious regression. 


We have not as yet developed the theory necessary to understand spurious 
regression with I(1) series. It is therefore worthwhile to illustrate the phe- 
nomenon with some computer simulations. For a large number of sample 
sizes between 20 and 20,000, we generated one million series of (x+, y+) pairs 
independently from the random walk model (14.03) and then ran the spurious 
regression (14.12). The dotted line near the top in Figure 14.1 shows the pro- 
portion of the time that the t statistic for G2 = 0 rejected the null hypothesis 
at the .05 level as a function of n. This proportion is very high even for small 
sample sizes, and it is clearly tending to unity as n increases. 


Upon reflection, it is not entirely surprising that tests based on the spurious 
regression model (14.12) do not yield sensible results. Under the null hypo- 
thesis that G2 = 0, this model says that y; is equal to a constant plus an ID 
error term. But in fact y, is a random walk generated by the DGP (14.03). 
Thus the null hypothesis that we are testing is false, and it is very common 
for a test to reject a false null hypothesis, even when the alternative is also 
false. We saw an example of this in Section 7.9; for an advanced discussion, 
see Davidson and MacKinnon (1987). 


It might seem that we could obtain sensible results by running the regression 


Yt = Pi + Bote + b3 Yt-1 + Ut, (14.13) 


since, if we set 61 = 0, 62 = 0, and (3 = 1, regression (14.13) reduces to the 
random walk (14.03), which is in fact the DGP for y in our simulations, with 
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v = e; being white noise. Thus it is a valid regression model to estimate. 
The lower dotted line in Figure 14.1 shows the proportion of the time that 
the t statistic for G2 = 0 in regression (14.13) rejected the null hypothesis at 
the .05 level. Although this proportion no longer tends to unity as n increases, 
it clearly tends to a number substantially larger than 0.05. This overrejection 
is a consequence of running a regression that involves I(1) variables. Both 
yz and yı are I(1) in this case, and, as we will see in Section 14.5, this 
implies that the t statistic for G2 = 0 does not have its usual asymptotic 
distribution, as one might suspect given that the n~!X'X matrix does not 
have a finite plim. 


The results in Figure 14.1 show clearly that spurious regressions actually 
involve at least two different phenomena. The first is that they involve testing 
false null hypotheses, and the second is that standard asymptotic results do 
not hold whenever at least one of the regressors is I(1), even when a model is 
correctly specified. 


As Granger (2001) has stressed, spurious regression can occur even when all 
variables are stationary. To illustrate this, Figure 14.1 also shows results of a 
second set of simulation experiments. These are similar to the original ones, 
except that x, and y, are now generated from independent AR(1) processes 
with mean zero and autoregressive parameter pı = 0.8. The higher solid line 
shows that, even for these data, which are stationary as well as independent, 
running the spurious regression (14.12) results in the null hypothesis being 
rejected a very substantial proportion of the time. In contrast to the previous 
results, however, this proportion does not keep increasing with the sample 
size. Moreover, as we see from the lower solid line, running the valid regres- 
sion (14.13) leads to approximately correct rejection frequencies, at least for 
larger sample sizes. Readers are invited to explore these issues further in 
Exercises 14.5 and 14.6. 


It is of interest to see just what gives rise to spurious regression with two 
independent AR(1) series that are stationary. In this case, the n~1X'X 
matrix does have a finite, deterministic, positive definite plim, and so that 
regularity condition at least is satisfied. However, because neither the constant 
nor x; has any explanatory power for y; in (14.12), the true error term for 
observation t is v; = yz, which is not white noise, but rather an AR(1) process. 
This suggests that the problem can be made to go away if we do not use 
the inappropriate OLS covariance matrix estimator, but instead use a HAC 
estimator that takes suitable account of the serial correlation of the errors. 
This is true asymptotically, but overrejection remains very significant until 
the sample size is of the order of several thousand; see Exercise 14.7. The use 
of HAC estimators is explored further in Exercises 14.8 and 14.9. 


As the results in Figure 14.1 illustrate, there is a serious risk of appearing to 
find relationships between economic time series that are actually independent. 
Although the risk can be far from negligible with stationary series which ex- 
hibit substantial serial correlation, it is particularly severe with nonstationary 
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ones. The phenomenon of spurious regressions was brought to the attention of 
econometricians by Granger and Newbold (1974), who used simulation meth- 
ods that were very crude by today’s standards. Subsequently, Phillips (1986) 
and Durlauf and Phillips (1988) proved a number of theoretical results about 
spurious regressions involving nonstationary time series. Granger (2001) pro- 
vides a brief overview and survey of the literature. 


14.3 Unit Root Tests 


For a number of reasons, it can be important to know whether or not an econ- 
omic time series has a unit root. As Figure 14.1 illustrates, the distributions 
of estimators and test statistics associated with I(1) regressors may well dif- 
fer sharply from those associated with regressors that are I(0). Moreover, as 
Nelson and Plosser (1982) were among the first to point out, nonstationarity 
often has important economic implications. It is therefore very important to 
be able to detect the presence of unit roots in time series, normally by the use 
of what are called unit root tests. For these tests, the null hypothesis is that 
the time series has a unit root and the alternative is that it is I(0). 


Dickey-Fuller Tests 


The simplest and most widely-used tests for unit roots are variants of ones 
developed by Dickey and Fuller (1979). These tests are therefore referred to 
as Dickey-Fuller tests, or DF tests. Consider the simplest imaginable AR(1) 
model, 

Ut = Buy—1 + O€t, (14.14) 


where £+ is white noise with variance 1. When 8 = 1, this model has a unit 
root and becomes a random walk process. If we subtract y;—1 from both sides, 
we obtain 

Ayi = (B — 1)yt-1 + cet. (14.15) 


Thus, in order to test the null hypothesis of a unit root, we can simply test 
the hypothesis that the coefficient of y:-; in equation (14.15) is equal to 0 
against the alternative that it is negative. 


Regression (14.15) is an example of what is sometimes called an unbalanced 
regression because, under the null hypothesis, the regressand is I(0) and the 
sole regressor is I(1). Under the alternative hypothesis, both variables are 
1(0), and the regression becomes balanced again. 


The obvious way to test the unit root hypothesis is to use the t statistic for 
the hypothesis @ — 1 = 0 in regression (14.15), testing against the alternative 
that this quantity is negative. This implies a one-tailed test. In fact, this 
statistic is referred to, not as a t statistic, but as a T statistic, because, as we 
will see, its distribution is not the same as that of an ordinary t statistic, even 
asymptotically. Another possible test statistic is n times the OLS estimate 
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of 8—1 from (14.15). This statistic is called a z statistic. Precisely why the z 
statistic is valid will become clear in the next subsection. Since the z statistic 
is a little easier to analyze than the 7 statistic, we focus on it for the moment. 


The z statistic from the test regression (14.15) is 


a4 Yr-1 Ayt 
Jri Ya l 


where, for ease of notation in summations, we suppose that yọ is observed. 
Under the null hypothesis, the data are generated by a DGP of the form 


Yt = Yt-1 + OEt, (14.16) 


or, equivalently, yz = yo + ow, where w+ is a standardized random walk 
defined in terms of ¢; by (14.01). For such a DGP, a little algebra shows that 
the z statistic becomes 


2 n 
a —1 Wt—1Et + TYOWn 
2 = 5 ~ 2a t—1¢t Yo - (14.17) 
g 2z wr_1 + 2Y00 a Wt-1 + NYG 


Since the right-hand side of this equation depends on yo and ø in a nontrivial 
manner, the z statistic is not pivotal for the model (14.16). However, when 
yo = 0, z no longer depends on øg, and it becomes a function of the random 
walk w; alone. In this special case, the distribution of z can be calculated, 
perhaps analytically and certainly by simulation, provided we know the dis- 
tribution of the €z. 


In most cases, we do not wish to assume that yo = 0. Therefore, we must look 
further for a suitable test statistic. Subtracting yo from both y+ and y;—1 in 
equation (14.14) gives 


Ay: = (1 — B)yo + (8 — 1)yt—1 + cet. 


Unlike (14.15), this regression has a constant term. This suggests that we 
should replace (14.15) by the test regression 


Ayt = Yo + (E — 1)yt-1 + êr. (14.18) 


Since yz = Yotou;:, we may write y = yo tow, where the notation should be 


A 


obvious. The z statistic from (14.18) is still n(8 — 1), and so, by application 
of the FWL theorem, it can be written under the null as 


er (Mey) -1 Ay = er (My) t-10t 
ie (My) yoni (My)? i 
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where M, is the orthogonal projection that replaces a series by its deviations 
from the mean. Since M,y = oM,w, it follows that 


—_— tai (Mw) 1-161 
ia (Mew) ey 


(14.20) 


where a factor of o? has been cancelled from the numerator and denominator. 
Since the w; are determined by the £+, the new statistic depends only on the 
series £+, and so it is pivotal for the model (14.16). 


If we wish to test the unit root hypothesis in a model where the random walk 
has a drift, the appropriate test regression is 


Ay: = yo + vit + (8 — 1)ye-1 + et, (14.21) 


and if we wish to test the unit root hypothesis in a model where the random 
walk has both a drift and a trend, the appropriate test regression is 


Ay: = yo + 71t + yat? + (8 — 1)yt-1 + er; (14.22) 


see Exercise 14.10. Notice that regression (14.15) contains no deterministic 
regressors, (14.18) has one, (14.21) two, and (14.22) three. In the last three 
cases, the test regression always contains one deterministic regressor that does 
not appear under the null hypothesis. 


Dickey-Fuller tests of the null hypothesis that there is a unit root may be 
based on any of regressions (14.15), (14.18), (14.21), or (14.22). In practice, 
regressions (14.18) and (14.21) are the most commonly used. The assumptions 
required for regression (14.15) to yield a valid test are usually considered to 
be too strong, while those that lead to regression (14.22) are often considered 
to be unnecessarily weak. 


The z and 7 statistics based on the testing regression (14.15) are denoted as 
Zne and Tne, respectively. The subscript “nc” indicates that (14.15) has no 
constant term. Similarly, z statistics based on regressions (14.18), (14.21), 
and (14.22) are written as Zc, Zet, and Zett, respectively, because these test 
regressions contain a constant, a constant and a trend, or a constant and 
two trends, respectively. A similar notation is used for the 7 statistics. It is 
important to note that all eight of these statistics have different distributions, 
both in finite samples and asymptotically, even under their corresponding null 
hypotheses. 


The standard test statistics for yı = 0 in regression (14.21) and for y2 = 0 
or y1 = %2 = 0 in regression (14.22) do not have their usual asymptotic 
distributions under the null hypothesis of a unit root; see Dickey and Fuller 
(1981). Therefore, instead of formally testing whether the coefficients of t and 
t? are equal to 0, many authors simply report the results of more than one 
unit root test. 
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Asymptotic Distributions of Dickey-Fuller Statistics 

The eight Dickey-Fuller test statistics that we have discussed have distribu- 
tions that tend to eight different asymptotic distributions as the sample size 
tends to infinity. These asymptotic distributions are referred to as nonstan- 
dard distributions or as Dickey-Fuller distributions. 

We will analyze only the simplest case, that of the Zne statistic, which is 
applicable only for the model (14.16) with yo = 0. For DGPs in that model, 
the test statistic (14.17) simplifies to 


dora We-18t 

ey we 
We begin by considering the numerator of this expression. By (14.02), we 
have that 


Zne =n (14.23) 


n n t—1 
X wer = Y eY Es. (14.24) 
t=1 t=1 s=1 

Since E(e£€s) = 0 for s < t, it is clear that the expectation of this quantity 

is zero. The right-hand side of (14.24) has $; (t — 1) = n(n — 1)/2 terms; 

recall the result used in (14.11). It is easy to see that the covariance of any 
two different terms of the double sum is zero, while the variance of each term 
is just 1. Consequently, the variance of (14.24) is n(n — 1)/2. The variance 
of (14.24) divided by n is therefore (1 — 1/n)/2, which tends to one half as 

n — oo. We conclude that n™t times (14.24) is O(1) as n — œœ. 

We saw in the last section, in equation (14.11), that the expectation of 

X; w? is n(n + 1)/2. Thus the expectation of the denominator of (14.23) 

is n(n — 1)/2, since the last term of the sum is missing. It can be checked by 

a somewhat longer calculation (see Exercise 14.11) that the variance of the 

denominator is O(n*) as n — oo, and so both the expectation and variance of 

the denominator divided by n? are O(1). We may therefore write (14.23) as 


-ison 
n~ J i Wiet 
Zne = 
—2 n—1 92 
n?) ii w 


where everything is of order unity. This explains why B — 1 is multiplied by n, 
rather than by n!/? or some other power of n, to obtain the z statistic. 


(14.25) 


In order to have convenient expressions for the probability limits of the ran- 
dom variables in the numerator and denominator of expression (14.25), we 
can make use of a continuous-time stochastic process called the standardized 
Wiener process, or sometimes Brownian motion. This process, denoted W (r) 
for 0 < r < 1, can be interpreted as the limit of the standardized random 
walk w; as the length of each interval becomes infinitesimally small. It is 
defined as 


[rn] 


W(r) = plimn™!/? Wirn] = plim n2 J (14.26) 


n— o0 n— o0 
t=1 
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where [rn] means the integer part of the quantity rn, which is a number be- 
tween 0 and n. Intuitively, a Wiener process is like a continuous random walk 
defined on the 0-1 interval. Even though it is continuous, it varies erratic- 
ally on any subinterval. Since £e; is white noise, it follows from the central 
limit theorem that W(r) is normally distributed for each r € [0,1]. Clearly, 
E(W(r)) = 0, and, since Var(w;) = t, it can be seen that Var(W(r)) =r. 
Thus W(r) follows the N(0,r) distribution. For further properties of the 
Wiener process, see Exercise 14.12. 


We can now express the limit as n — oo of the numerator of the right-hand 
side of equation (14.25) in terms of the Wiener process W(r). Note first that, 
since Wy41 — Wt = E141, 


n n-1 n-1 n-1 n-1 

2 2 2 2 
> W; = > (w: + (wisi — wt) = ` w; +2 J WtEt+1 F J Et+1: 
t=1 t=0 t=0 t=0 t=0 


Since wo = 0, the term on the left-hand side above is the same as the first 
term of the rightmost expression, except for the term w2. Thus we find that 


n 


n—l1 n 
— _ lf, 2 2 
WtEt+1 = Wt—1Et = 3 Wn — Et]. 
t=0 t=1 


t=1 
Dividing by n and taking the limit as n — oo gives 
n 


plim = X w164 = 5 (W° (1) - 1), (14.27) 


n— Co t=1 


where we have used the law of large numbers to see that plimn7!S> e? = 1. 


For the denominator of the right-hand side of equation (14.25), we see that 


n-1 n-1 
—2 2 1 a(t 
D =F w(F). 
t=1 t=1 


If f is an ordinary nonrandom function defined on [0,1], the Riemann integral 
of f on that interval can be defined as the following limit: 


1 . 4 
| a in 2 f(&). (14.28) 


It turns out to be possible to extend this definition to random integrands in 
a natural way. We may therefore write 
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which, combined with equation (14.27), gives 


5(W?(1) — 1) 


plim zne = “r ———. (14.29) 
n=o0 J, W?(r)dr 
A similar calculation (see Exercise 14.13) shows that 
i(W°(1)-1 
pliit Tno = al (l - ; (14.30) 
n—=oo (fy W?(r)dr) 


More formal proofs of these results can be found in many places, including 
Banerjee, Dolado, Galbraith, and Hendry (1993, Chapter 4), Hamilton (1994, 
Chapter 17), Fuller (1996), Hayashi (2000, Chapter 9), and Bierens (2001). 


Results for the other six test statistics are more complicated. For zę and Te, 
the limiting random variables can be expressed in terms of a centered Wiener 
process. Similarly, for ze and Tet, one needs a Wiener process that has been 
centered and detrended, and so on. For details, see Phillips and Perron (1988) 
and Bierens (2001). Exercise 14.14 looks in more detail at the limit of ze. 


Unfortunately, although the quantities (14.29) and (14.30) and their analogs 
for the other test statistics have well-defined distributions, there are no simple, 
analytical expressions for them.? In practice, therefore, these distributions 
are always evaluated by simulation methods. Published critical values are 
based on a very large number of simulations of either the actual test statistics 
or of quantities, based on simulated random walks, that approximate the 
expressions to which the statistics converge asymptotically under the null 
hypothesis. For example, in the case of (14.30), the quantity to which Tne 
tends asymptotically, such an approximation is given by 


za =l) 
= n 1/2? 
(n te) 


where the w; are generated by the standardized random walk process (14.01). 


Various critical values for unit root and related tests have been reported in 
the literature. Not all of these are particularly accurate. Some authors fail to 
use a sufficiently large number of replications, and many report results based 
on a single finite value of n instead of using more sophisticated techniques 
in order to estimate the asymptotic distributions of interest. See MacKinnon 
(1991, 1994, 1996). The last of these papers probably gives the most accurate 
estimates of Dickey-Fuller distributions that have been published. It also 
provides programs, which are freely available, that make it easy to calculate 
critical values and P values for all of the test statistics discussed here. 


2 Abadir (1995) does provide an analytical expression for the distribution of Tne, 
but it is certainly not simple. 
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Figure 14.2 Asymptotic densities of Dickey-Fuller 7 tests 


The asymptotic densities of the Tne, Te, Tet, and Tett Statistics are shown in 
Figure 14.2. For purposes of comparison, the standard normal density is also 
shown. The differences between it and the four Dickey-Fuller 7 distributions 
are striking. The critical values for one-tail tests at the .05 level based on the 
Dickey-Fuller distributions are also marked on the figure. These critical values 
become more negative as the number of deterministic regressors in the test 
regression increases. For the standard normal distribution, the corresponding 
critical value would be —1.645. 


The asymptotic densities of the Zne, Ze, Zct, and 214 statistics are shown 
in Figure 14.3. These are much more spread out than the densities of the 
corresponding 7 statistics, and the critical values are much larger in absolute 
value. Once again, these critical values become more negative as the number 
of deterministic regressors in the test regression increases. Since the test 
statistics are equal to n(8 — 1), it is easy to see how these critical values 
are related to B for any given sample size. For example, when n = 100, the 
Ze test rejects the null hypothesis of a unit root whenever 8 < 0.859, and the 
Zet test rejects the null whenever 8 < 0.783. Evidently, these tests have little 
power if the data are actually generated by a stationary AR(1) process with 3 
reasonably close to unity. 


Of course, the finite-sample distributions of Dickey-Fuller test statistics are 
not the same as their asymptotic distributions, although the latter generally 
provide reasonable approximations for samples of moderate size. The pro- 
grams in MacKinnon (1996) actually provide finite-sample critical values and 
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Figure 14.3 Asymptotic densities of Dickey-Fuller z tests 


P values as well as asymptotic ones, but only under the strong assumptions 
that the error terms are normally and identically distributed. Neither of these 
assumptions is required for the asymptotic distributions to be valid. However, 
the assumption that the error terms are serially independent, which is often 
not at all plausible in practice, is required. 


14.4 Serial Correlation and Unit Root Tests 


Because the unit root test regressions (14.15), (14.18), (14.21), and (14.22) 
do not include any economic variables beyond y_;, the error terms u; may 
well be serially correlated. This very often seems to be the case in practice. 
But this means that the Dickey-Fuller tests we have described are no longer 
asymptotically valid. A good many ways of modifying the tests have been 
proposed in order to make them valid in the presence of serial correlation 
of unknown form. The most popular approach is to use what are called 
augmented Dickey-Fuller, or ADF, tests. They were proposed originally by 
Dickey and Fuller (1979) under the assumption that the error terms follow an 
AR process of known order. Subsequent work by Said and Dickey (1984) and 
Phillips and Perron (1988) showed that they are asymptotically valid under 
much less restrictive assumptions. 


Consider the test regressions (14.15), (14.18), (14.21), or (14.22). We can 
write any of these regressions as 


Ayr = XY + (B = 1)yt-1 + Ut, (14.31) 
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where X; is a row vector that consists of whatever deterministic regressors 
are included in the test regression. Now suppose, for simplicity, that the error 
term uz in (14.31) follows the stationary AR(1) process uz = piut—i + €t, 
where e; is white noise. Then regression (14.31) would become 


Ayi = Xy? — pi X17? + (91 + 8-1) yt-1 — Ber ye-2 + et 
= Xy + (91 + 8—1- Bpr)ye-1 + Bor (ye-1 — Ye-2) + €r 
= Xy + (8—1) 0 — pi)ye-1 + Bor Aye-1 + et 
= 47+ ya tule, (14.32) 


We are able to replace Xy7y° — pi Xt_-17° by Xy in the second line here, 
for some choice of y, because every column of X;_, lies in 8(X). This is 
a consequence of the fact that X; can include only deterministic variables 
such as a constant, a linear trend, and so on. Each element of y is a linear 
combination of the elements of y°. Expression (14.32) is just the regression 
function of (14.31), with one additional regressor, namely, Ay,_;. Adding 
this regressor has caused the serially dependent error term u; to be replaced 
by the white-noise error term ez. 


The ADF version of the 7 statistic is simply the ordinary t statistic for the 
coefficient 3’ on y+—1 in (14.32) to be zero. If the serial correlation in the error 
terms were fully accounted for by an AR(1) process, it turns out that this 
statistic would have exactly the same asymptotic distribution as the ordinary 
T statistic for the same specification of X;. The fact that 8’ is equal to 
(8 —1)(1 — pı) rather than 8—1 does not matter. Because it is assumed that 
|pi1| < 1, this coefficient can be zero only if 6 = 1. Thus a test for 6’ = 0 in 
regression (14.32) is equivalent to a test for 6 = 1. 


It is very easy to compute ADF 7 statistics using regressions like (14.32), but 
it is not quite so easy to compute the corresponding z statistics. If B were 
multiplied by n, the result would be n(@ — 1)(1 — f1) rather than n(G — 1). 
The former statistic clearly would not have the same asymptotic distribution 
as the latter. To avoid this problem, we need to divide by 1 — 61. Thus, a 


valid ADF z statistic based on regression (14.32) is n@’/(1 — f1). 


In this simple example, we were able to handle serial correlation by adding 
a single regressor, Ay,_1, to the test regression. It is easy to see that, if u 
followed an AR(p) process, we would have to add p additional regressors, 
namely, Ay;1, Ayz—2, and so on up to Ay_». But if the error terms followed 
a moving average process, or a process with a moving average component, it 
might seem that we would have to add an infinite number of lagged values 
of Ay; in order to model them. However, we do not have to do anything so 
extreme. As Said and Dickey (1984) showed, we can validly use ADF tests 
even when there is a moving average component in the errors, provided we let 
the number of lags of Ay; that are included tend to infinity at an appropriate 
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rate, which turns out to be a rate slower than n!/°. See Galbraith and Zinde- 
Walsh (1999). This is a consequence of the fact that every moving average 
and ARMA process has an AR(oo) representation; see Section 13.2. 


To summarize, provided the number of lags p is chosen appropriately, we can 
always base both types of ADF test on the regression 


p 
Ay: = Xey + B'Yt—1 + D ÒjAYt—j + et, (14.33) 
j=l 


where X; is a row vector of deterministic regressors, and 8’ and the 4; are 
functions of 3 and the p coefficients in the AR(p) representation of the process 
for the error terms. The 7 statistic is just the ordinary t statistic for 3’ = 0, 
and the z statistic is ” 

pn l o (14.34) 

(1 — Ert) 

Under the null hypothesis of a unit root, and for a suitable choice of p (which 
must increase with n), the asymptotic distributions of both z and 7 statistics 
are the same as those of ordinary Dickey-Fuller statistics for the same set 
of regressors X;. Because a general proof of this result is cumbersome, it is 
omitted, but an important part of the proof is treated in Exercise 14.16. 


In practice, of course, since n is fixed for any sample, knowing that p should 
increase at a rate slower than n!/* provides no help in choosing p. Moreover, 
investigators do not know what process is actually generating the error terms. 
Thus what is generally done is simply to add as many lags of Ay; as appear 
to be necessary to remove any serial correlation in the residuals. Formal 
procedures for determining just how many lags to add are discussed by Ng 
and Perron (1995, 2001). As we will discuss in the next section, conventional 
methods of inference, such as t and F tests, are asymptotically valid for any 
parameter that can be written as the coefficient of an I(0) variable. Since 
Ay: is 1(0) under the null hypothesis, this result applies to regression (14.33), 
and we can use standard methods for determining how many lags to include. 
If too few lags of Ay, are added, the ADF test may tend to overreject the 
null hypothesis when it is true, but adding too many lags tends to reduce the 
power of the test. 


The finite-sample performance of ADF tests is rather mixed. When the serial 
correlation in the error terms is well approximated by a low-order AR(p) 
process without any large, negative roots, ADF tests generally perform quite 
well in samples of moderate size. However, when the error terms seem to 
follow an MA or ARMA process in which the moving average polynomial has 
a large negative root, they tend to overreject severely. See Schwert (1989) 
and Perron and Ng (1996) for evidence on this point. Standard techniques 
for bootstrapping ADF tests do not seem to work particularly well in this 
situation, although they can improve matters somewhat; see Li and Maddala 
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(1996). The problem is that it is difficult to generate bootstrap error terms 
with the same time-series properties as the unknown process that actually 
generated the us. Recent work in this area includes Park (2002) and Chang 
and Park (2003). 


Alternatives to ADF Tests 


Many alternatives to, and variations of, augmented Dickey-Fuller tests have 
been proposed. Among the best known are the tests proposed by Phillips and 
Perron (1988). These Phillips-Perron, or PP, tests have the same asymptotic 
distributions as the corresponding ADF z and 7 tests, but they are computed 
quite differently. The test statistics are based on a regression like (14.31), 
without any modification to allow for serial correlation. A form of HAC 
estimator is then used when computing the test statistics to ensure that serial 
correlation does not affect their asymptotic distributions. Because there is 
now a good deal of evidence that PP tests perform less well in finite samples 
than ADF tests, we will not discuss them further; see Schwert (1989) and 
Perron and Ng (1996), among others, for evidence on this point. 


A procedure that does have some advantages over the standard ADF test is 
the ADF-GLS test proposed by Elliott, Rothenberg, and Stock (1996). The 
idea is to obtain higher power by estimating ~y prior to estimating 8’. As can 
readily be seen from Figures 14.2 and 14.3, the more deterministic regressors 
we include in X;, the larger (in absolute value) become the critical values for 
ADF tests based on regression (14.32). Inevitably, this reduces the power of 
the tests. The ADF-GLS test estimates y° by running the regression 


Yt — PYt—-1 = (Xt — PXi-1) V? + v, (14.35) 


where X; contains either a constant or a constant and a trend, and the fixed 
scalar J is equal to 1+ ¢/n, with ¢ = —7 when X; contains just a constant 
and č = —13.5 when it contains both a constant and a trend. Notice that p 
tends to unity as n — oo. Let ¥° denote the estimate of y° obtained from 
regression (14.35). Then construct the variable y; = y — X:4¥° and run the 
test regression 


p 
Ay; = B'yi +Ò jAy- + et 
j=1 

which looks just like regression (14.32) for the case with no constant term. The 
test statistic is the ordinary t statistic for 3’ = 0. When X; contains only a 
constant term, this test statistic has exactly the same asymptotic distribution 
as Tne. When X; contains both a constant and a trend, it has an asymptotic 
distribution that was derived and tabulated by Elliott, Rothenberg, and Stock 
(1996). This distribution, which depends on @, is quite close to that of re. 


There is a massive literature on unit root tests, most of which we will not 
attempt to discuss. Hayashi (2000) and Bierens (2001) provide recent treat- 
ments that are more detailed than ours. 
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14.5 Cointegration 


Economic theory often suggests that two or more economic variables should be 
linked more or less closely. Examples include interest rates on assets of differ- 
ent maturities, prices of similar commodities in different countries, disposable 
income and consumption, government spending and tax revenues, wages and 
prices, and the money supply and the price level. Although deterministic rela- 
tionships among the variables in any one of these sets are usually assumed to 
hold only in the long run, economic forces are expected to act in the direction 
of eliminating short-run deviations from these long-term relationships. 


A great many economic variables are, or at least appear to be, I(1). As we saw 
in Section 14.2, random variables which are I(1) tend to diverge as n — ov, 
because their unconditional variances are proportional to n. Thus it might 
seem that two or more such variables could never be expected to obey any sort 
of long-run relationship. But, as we will see, variables that are all individually 
I(1), and hence divergent, can in a certain sense diverge together. Formally, it 
is possible for some linear combinations of a set of I(1) variables to be I(0). If 
that is the case, the variables are said to be cointegrated. When variables are 
cointegrated, they satisfy one or more long-run relationships, although they 
may diverge substantially from these relationships in the short run. 


VAR Models with Unit Roots 


In Chapter 13, we saw that a convenient way to model several time series 
simultaneously is to use a vector autoregression, or VAR model, of the type 
introduced in Section 13.7. Just as with univariate AR models, a VAR model 
can have unit roots and so give rise to nonstationary series. We begin by 
considering the simplest case, namely, a VAR(1) model with just two variables. 
We assume, at least for the present, that there are neither constants nor 
trends. Therefore, we can write the model as 


Yer = O11Yt-1,1 + Q12Yt—1,2 + Uti, lee 


Yr2 = a1 Yt—1,1 + Q22Yt—1,2 + U2, Ut2 


| ~ IID(0, 2). (14.36) 


Let z; and u; be 2-vectors, the former with elements y+ and y2 and the latter 
with elements u; and ug, and let be the 2 x 2 matrix with ijt? element 
Qij- Then equations (14.36) can be written as 


Zt = P21 + Ut, Ut ~ IID(O, 22). (14.37) 


In order to keep the analysis as simple as possible, we assume that zp = 0. 
This implies that the solution to the recursion (14.37) is 


t 


aay Duy (14.38) 


s=1 
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A univariate AR model has a unit root if the coefficient on the lagged depen- 
dent variable is equal to unity. Analogously, as we now show, the VAR model 
(14.36) has a unit root if an eigenvalue of the matrix ® is equal to 1. 


Recall from Section 12.8 that the matrix ® has an eigenvalue À and cor- 
responding eigenvector x if Px = Ax. For a 2 x 2 matrix, there are two 
eigenvalues, A, and A2. If Ay; Æ Ag, there are two corresponding eigenvectors, 
ax, and x2, which are linearly independent; see Exercise 14.17. If Ay = A2, we 
assume, with only a slight loss of generality, that there still exist two linearly 
independent eigenvectors xı and #2. Then, as in equation (12.116), we can 


write 
à 0 
0 Al 


It follows that B?°X = (6X) = BXA = XA?. Performing this operation 
repeatedly shows that, for any positive integer s, PSX = XA’. 


BX = XA, with X = [zı x] and A= | 


The solution (14.38) can be rewritten in terms of the eigenvalues and eigen- 
vectors of ® as follows: 


t 
Xe => AO ay. (14.39) 


s=1 


The inverse matrix X~! exists because xı and 2x2 are linearly independent. 
It is then not hard to show that the solution (14.39) can be written as 


t t 
— AtS JÉ Ats 
Ytı = T11 1 €sl T T12 2  €s2, 
s=1 s=1 


t t 
— AtS Ats 
Yt2 = T21 1 €s1 T T22 2 ©&s2, 
s=l 


s=1 


(14.40) 


where e; = [en i er2] ~ IID(O, X), X = X~12(X")-1, and ziz is the ij™ ele- 
ment of X. 


It can be seen from equations (14.40) that the series yz; and yg are both 
linear combinations of the two series 


t t 


v = > Mes, and v2= `> AS Sesa. (14.41) 


s=1 s=1 


If both eigenvalues are less than 1 in absolute value, then v4; and v;2 are I(0). 
If both eigenvalues are equal to 1, then the two series are random walks, and 
consequently yz; and y2 are I(1). If one eigenvalue, say Aj, is equal to 1 
while the other is less than 1 in absolute value, then v+ is a random walk, 
and v2 is I(0). In general, then, both yn and y: are I(1), although there 
exists a linear combination of them, namely v2, that is 1(0). According to 
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the definition we gave above, yz; and y2 are cointegrated in this case. Each 
differs from a multiple of the random walk v; by a process that, being I(0), 
does not diverge and has a finite variance as t — oo. 


Quite generally, if the series y%; and y2 are cointegrated, then there exists a 
2-vector 7 with elements nı and 72 such that 


Ve = N’ Zi = mye + Nye (14.42) 


is I(0). The vector 77 is called a cointegrating vector. It is clearly not unique, 
since it could be multiplied by any nonzero scalar without affecting anything 
except the sign and the scale of 1. 


Equation (14.42) is an example of a cointegrating regression. This particular 
one is unnecessarily restrictive. In practice, we might expect the relationship 
between yz; and y: to change gradually over time. We can allow for this by 
adding a constant term and, perhaps, one or more trend terms, so as to obtain 


nia = Xiytu, (14.43) 


where X, denotes a deterministic row vector that may or may not have any 
elements. If it does, the first element is a constant, the second, if it exists, 
is normally a linear time trend, the third, if it exists, is normally a quadratic 
time trend, and so on. There could also be seasonal dummy variables in X;. 
Since z; could contain more than two variables, equation (14.43) is actually 
a very general way of writing a cointegrating regression. The error term 
v = nz, — Xıy that is implicitly defined in equation (14.43) is called the 
equilibrium error. 


Unless each of a set of cointegrated variables is I(1), the cointegrating vec- 
tor is trivial, since it has only one nonzero element, namely, the one that 
corresponds to the I(0) variable. Therefore, before estimating equations like 
(14.42) and (14.43), it is customary to test the null hypothesis that each of 
the series in z; has a unit root. If this hypothesis is rejected for any of the 
series, it is pointless to retain it in the set of possibly cointegrated variables. 


When there are more than two variables involved, there may be more than 
one cointegrating vector. For the remainder of this section, however, we will 
focus on the case in which there is just one such vector. The more general 
case, in which there are g variables and up to g— 1 cointegrating vectors, will 
be discussed in the next section. 


It is not entirely clear how to specify the deterministic vector X; in a coint- 
egrating regression like (14.43). Ordinary t and F tests are not valid, partly 
because the stochastic regressors are not I(0) and any trending regressors do 
not satisfy the usual conditions for the matrix n~1X'X to tend to a positive 
definite matrix as n — oo, and partly because the error terms are likely to 
display serial correlation. As with unit root tests, investigators commonly use 
several choices for X; and present several sets of results. 
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Estimating Cointegrating Vectors 


If we have a set of I(1) variables that may be cointegrated, we usually wish 
to estimate the parameters of the cointegrating vector 7. Logic dictates that, 
before doing so, we should perform one or more tests to see if the data seem 
compatible with the existence of a cointegrating vector, but it is easier to 
discuss estimation before testing. Testing is the topic of the next section. 


The simplest way to estimate a cointegrating vector is just to pick one of the 
I(1) variables and regress it on X; and the other I(1) variables by OLS. Let 
Y; = [yz Yi2| be a 1 x g row vector containing all the I(1) variables, y; being 
the one selected as regressand. The OLS regression can then be written as 


Yt = Xt + Yom + v, (14.44) 


where 7 = [1 į: —n2]. The nonuniqueness of 77 is resolved here by setting the 
first element to 1. The OLS estimator 72 is known as the levels estimator. 


At first sight, this approach seems to ignore all the precepts of good economet- 
ric practice. If the yr are generated by a DGP belonging to a VAR model, 
such as (14.36) in the case of two variables, then they are all endogenous. 
Therefore, unless the error terms in every equation of the VAR model hap- 
pen to be uncorrelated with those in every other equation, the regressors Y;2 
in equation (14.44) will be correlated with the error term 1. In addition, 
this error term will often be serially correlated. As we will see below, for 
the model (14.36), ve depends on the serially correlated series v2 defined 
in the second of equations (14.41). Nevertheless, the levels estimator of the 
vector 72 is not only consistent but super-consistent, in a sense to be made 
explicit shortly. This result indicates just how different asymptotic theory is 
when I(1) variables are involved. 


Let us suppose that we have two cointegrated series, yz; and yi2, generated 
by equations (14.40), with A; = 1 and |A9| < 1. By use of (14.41), we have 


Yet = L11Vt1 + T1202, and Yo = Lo1Ve1 + T2202, (14.45) 


where v;; is a random walk, and wz, is I(0). For simplicity, suppose that X; 
is empty in regression (14.44), Y = yt, and Yi2 has the single element yo. 
Then we have P 
i= Dota Ye2 Yel 
a. a aE 
Da Yen 


where 72 is the OLS estimator of the single element of 72. 


(14.46) 


It follows from equations (14.45) that the denominator of the right-hand side 
of equation (14.46) is 


34 > vj + 2221222 > Vti V2 + >. vh. (14.47) 
t=1 t=1 t=1 
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Since Var(ez1) = 011, the element in the first row and column of the covariance 
matrix X of the innovations e1 and e2, we see that the random walk v;; can 
be expressed as of w, for a standardized random walk w. We saw from 
the argument following expression (14.10) that 57>", w? = O(n?) as n > ov, 
and so the first term of (14.47) is O(n”). The series v has a stationary 
variance; in fact E(v2,) tends to o22/(1 — |A2|?) as t oo. By the law of 
large numbers, therefore, the last term of (14.47), divided by n, tends to this 
stationary variance as n — oo. The term itself is thus O(n). By an argument 
similar to the one we used to show that the expression (14.24) is O(n), we 
can show that the middle term in (14.47) is O(n); see Exercise 14.18. 


In like manner, we see that the numerator of the right-hand side of (14.46) is 


n nm n 
2 2 
T1121 > Vi + (£11822 + £12 £21) > Ut1U12 + T12 L22 > Vip. (14.48) 
=i = =i 


The first term here is O(n?), and the other two are O(n). Thus, if we divide 
both numerator and denominator in (14.46) by n?, only the first terms in 
expressions (14.47) and (14.48) contribute nonzero limits as n — co. The 
factors 221 X` vå cancel, and the limit of z is therefore seen to be 211/21. 
From equations (14.45), we see that 

T11 _ 12021 — 111222 


Yel — — Yt = Ut2; 
T21 T21 


from which, given that v2 is stationary, we conclude that [1 } —x11/z£21] is in- 
deed the cointegrating vector. It follows that 72 is consistent for n2 = £11/£21. 


If we divide expression (14.47) by x21 X` vå, which is O(n”), we obtain the 
result £21 + O(n~"), since the last two terms of (14.47) are O(n). Similarly, 
dividing expression (14.48) by the same quantity gives 71,;+O(n~'). It follows 
that f2— n2 = O(n~'). This is the property of super-consistency mentioned 
above. It implies that the estimation error fg — nz tends to zero like n~! as 
n — œ. We may say that 72 is n-consistent, unlike the root-n consistent 
estimators of conventional asymptotic theory. Note, however, that instincts 
based on conventional theory are correct to the extent that 72 is biased in 
finite samples. This fact can be worrisome in practice, and it is therefore 
often desirable to find alternative ways of estimating cointegrating vectors. 


With a little more work, it can be seen that the super-consistency result 
applies more generally to cointegrating regressions like (14.43), with deter- 
ministic regressors such as a constant and a trend, when one element of Y; is 
arbitrarily given a coefficient of unity and the others moved to the right-hand 
side. For a rigorous discussion of this result, see Stock (1987). Note also that 
we do not as yet have the means to perform statistical inference on cointe- 
grating vectors, since we have not studied the asymptotic distribution of the 
order-unity quantity n(%2 — 2), which turns out to be nonstandard. We will 
discuss this point further later in this section. 
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Estimation Using an ECM 


We mentioned in Section 13.4 that an error correction model can be used 
even when the data are nonstationary. In order to justify this assertion, we 
start again from the simplest case, in which the two series y; and y2 are 
generated by the two equations (14.45). From the definition (14.41) of the 
I(0) process v42, we have 


Avia = (A2 = 1)v4-1,2 + €42. (14.49) 
We may invert equations (14.45) as follows: 
va = ryn +y, and vg = gyn +e” yo, (14.50) 


where x” is the ij*® element of the inverse X~! of the matrix with typical 
element x;;. If we use the expression for v2 and its first difference given by 
equations (14.50), then equation (14.49) becomes 


Aya = =x? Ayo + (Az — 1)(2? yi1, + 07? y4-1,2) + e12. 


Dividing by x?! and noting that the relation between the inverse matrices 
implies that x7!a 1, + x??x9, = 0, we obtain the error-correction model 


Ayn = MAy + (Az — 1)(ye-1,1 — M2Yt-1,2) + Cfo; (14.51) 


where, as above, 72 = 211/221 is the second component of the cointegrating 
vector, and ef) = e;2/x7'. Although the notation is somewhat different from 
that used in Section 13.3, it is easy enough to see that equation (14.51) is 
a special case of an ECM like (13.62). Notice that it must be estimated by 
nonlinear least squares. 


In general, equation (14.51) is an unbalanced regression, because it mixes the 
first differences, which are I(0), with the levels, which are I(1). But the linear 
combination y:—-1,1 — 72Yt—1,2 is 1(0), on account of the cointegration of y+ 
and yz2. The term (Az — 1)(y—-1,1 — N2Yt-1,2) is precisely the error-correction 
term of this ECM. Indeed, y—1,1 — n2yt-1,2 is the equilibrium error, and it 
influences Ay;; through the negative coefficient Az — 1. 


The parameter 72 appears twice in (14.51), once in the equilibrium error, 
and once as the coefficient of Ay;2. The implied restriction is a consequence 
of the very special structure of the DGP (14.45). It is the parameter that 
appears in the equilibrium error that defines the cointegrating vector, not the 
coefficient of Ay;2. This follows because it is the equilibrium error that defines 
the long-run relationship linking yz1 and y2, whereas the coefficient of Aya is 
a short-run multiplier, determining the immediate impact of a change in yz2 
on yz1- It is usually thought to be too restrictive to require that the long-run 
and short-run multipliers should be the same, and so, for the purposes of 
estimation and testing, equation (14.51) is normally replaced by 


Ayn = aAyr + 61ye—-1,1 + Ô2Yt—1,2 + €t, (14.52) 
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where the new parameter a is the short-run multiplier, 6, = Ag — 1, and 
dg = (1 — à2)ņ2. Since (14.52) is just a linear regression, the parameter 
of interest, which is 72, can be estimated by z = — ô» / 61, using the OLS 
estimates of 6, and ôs. 


Equation (14.52) is without doubt an unbalanced regression, and so we must 
expect that the OLS estimates will not have their usual distributions. It 
turns out that 72 is a super-consistent estimator of 72. In fact, it is usually 
less biased than the estimate obtained from the simple regression of y2 on Yt, 
as readers are invited to check by simulation in Exercise 14.20. 


In the general case, with k cointegrated variables, we may estimate the coint- 
egrating vector using the linear regression 


Ay: = Xey + AY a + dyz—1 + Yi-1,202 + et, (14.53) 


where, as before, X; is a vector of deterministic regressors, ~y is the associated 
parameter vector, Y, = [y Y2] is a 1 x k vector, ô is a scalar, and œ and 
62 are both (k — 1)-vectors. Regression (14.52) is evidently a special case 
of regression (14.53). The super-consistent ECM estimator of 72 is then the 
ratio of the OLS estimator & to the OLS estimator ô. 


Other approaches 


When we cannot, or do not want to, specify an ECM, at least two other 
methods are available for estimating a cointegrating vector. One, proposed 
by Phillips and Hansen (1990), is called fully modified estimation. The idea 
is to modify the OLS estimate of 2 in equation (14.44) by subtracting an 
estimate of the bias. The result turns out to be asymptotically multivariate 
normal, and it is possible to estimate its asymptotic covariance matrix. To 
explain just how fully modified estimation works would require more space 
than we have available. Interested readers should consult the original paper 
or Banerjee, Dolado, Galbraith, and Hendry (1993, Chapter 7). 


A second approach, which is due to Saikkonen (1991), is much simpler to 
describe and implement. We run the regression 


P 
ye = Xey + Yon + SAV 4526) + v (14.54) 
Jj==p 


by OLS. Observe that regression (14.54) is just regression (14.44) with the 
addition of p leads and p lags of the first differences of Y;2. As with augmented 
Dickey-Fuller tests, the idea is to add enough leads and lags so that the error 
terms appear to be serially independent. Provided that p is allowed to increase 
at the appropriate rate as n — oo, this regression yields estimates that are 
asymptotically efficient. 
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Inference in Regressions with I(1) Variables 


From what we have said so far, it might seem that standard asymptotic results 
never apply when a regression contains one or more regressors that are I(1). 
This is true for spurious regressions like (14.12), for unit root test regressions 
like (14.18), and for error-correction models like (14.52). In all these cases, 
certain statistics that are computed as ordinary t statistics actually follow 
nonstandard distributions asymptotically. 


However, it is not true that the ¢ statistic on every parameter in a regression 
that involves I(1) variables follows a nonstandard distribution asymptotic- 
ally. It is not even true that the t statistic on every coefficient of an I(1) 
variable follows such a distribution. Instead, as Sims, Stock, and Watson 
(1990) showed in a famous paper, the t statistic on any parameter that ap- 
pears only as the coefficient of an I(0) variable, perhaps after the regressors 
are rearranged, follows the standard normal distribution asymptotically. Sim- 
ilarly, an F statistic for a test of the hypothesis that any set of parameters 
is zero follows its usual asymptotic distribution if all the parameters can be 
written as coefficients of I(0) variables at the same time. On the other hand, 
t statistics and F statistics corresponding to parameters that do not satisfy 
this condition generally follow nonstandard limiting distributions, although 
there are certain exceptions that we will not discuss here; see West (1988) 
and Sims, Stock, and Watson (1990). 


We will not attempt to prove these results, which are by no means trivial. 
Proofs may be found in the original paper by Sims et al., and there is a some- 
what simpler discussion in Banerjee, Dolado, Galbraith, and Hendry (1993, 
Chapter 6). Instead, we will consider two examples that should serve to illus- 
trate the nature of the results. First, consider a simple ECM reparametrized 
as equation (14.52). When y:n and y:2 are not cointegrated, it is impossible 
to arrange things so that 6, is the coefficient of an I(0) variable. Therefore, 
the t statistic for 6; = 0 follows a nonstandard distribution asymptotically. 
However, when yi and y2 are cointegrated, the quantity Y+—1,1 — N2Yt—1,2 is 
1(0). In this case, therefore, 6, is the coefficient of an I(0) variable, and the 
t statistic for 6; = dı is asymptotically distributed as N (0,1), if the true value 
of 6, is the negative number d,. 


We can rewrite equation (14.52) as 


Ayn = aAy2 — Ô2(M1Yt—1,1 — Yt—1,2) + et, (14.55) 


where 7 = 1/72 = —61/62. In equation (14.55), 62 is written as the coefficient 
of a variable that is I(0) if yz and yp are cointegrated. It follows that the 
t statistic for a test that > is equal to its true (presumably positive) value is 
asymptotically distributed as N(0, 1). 


We have just seen that, when y+ and y2 are cointegrated, equation (14.52) 
can be rewritten is such a way that either 6, or ô> is the coefficient of an 
I(0) variable. Consequently, the t statistic on every coefficient in (14.52) is 
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asymptotically normally distributed. Despite this, it is not the case that an 
F statistic for a test concerning both 6; and 62 follows its usual asymptotic 
distribution under the null hypothesis. This is because we cannot rewrite 
(14.52) so that both 6, and dy are coefficients of I(0) variables at the same 
time. Indeed, if 6; and 62 were jointly asymptotically normal, the ratio 72 
would also be asymptotically normal, with the same rate of convergence, in 
contradiction of the result that #2 is super-consistent. 


It is not obvious how it is possible for both bt and by to be asymptotic- 
ally normal, with the usual root-n rate of convergence, while the ratio Ĥo is 
super-consistent. The phenomenon is explained by the fact, which we will 
not attempt to demonstrate in detail here, that the two random variables 
n'/2(6, — ôı)/ôı and n!/2(ôz — 02)/d2 tend as n — oo to exactly the same 
random variable, and so differ only at order n~!/?. The two variables are 
therefore perfectly correlated asymptotically. It is straightforward (see Exer- 
cise 14.21) to show that this implies that 


bn by 


SS L O(n—"). 14.56 
z =g TO) (14.56) 


uP 


This result expresses the super-consistency of fə. 


As a second example, consider the augmented Dickey-Fuller test regression 
Ay: = Y + B'ye- + AY + ee, (14.57) 
which is a special case of equation (14.32). This can be rewritten as 


Ay = Y + B'm—1 + O1yz-1 — O1ye—-2 + et 


(14.58) 
= y + B'(ye-1 — Ye-2) + Orye-1 + (8 — 51) ye-2 + €r. 

When y is I(1), we cannot write this regression in such a way that 8’ is the 
coefficient of an I(0) variable. In the second line of (14.58), it does multiply 
such a variable, since yz-1 — yz—2 is I(0), but it also multiplies y:~2, which is 
I(1). Thus we may expect that the t statistic for 6’ = 0 has a nonstandard 
asymptotic distribution. As we saw in Section 14.3, that is indeed the case, 
since it follows the Dickey-Fuller 7, distribution graphed in Figure 14.2. 


On the other hand, because Ay;_; is I(0), the t statistic for 6, = 0 in equation 
(14.57) does follow the standard normal distribution asymptotically. More- 
over, F tests for the coefficients of more than one lag of Ay; to be jointly zero 
also yield statistics that follow the usual asymptotic F distribution. That 
is why we can validly use standard tests to decide how many lags of Ayı 
to include in the test regression (14.33) that is used to perform augmented 
Dickey-Fuller tests. 
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Estimation by a Vector Autoregression 


The procedures we have discussed so far for estimating and making inferences 
about cointegrating vectors are all in essence single-equation methods. A very 
popular alternative to those methods is to estimate a vector autoregression, 
or VAR, for all of the possibly cointegrated variables. The best-known such 
methods were introduced by Johansen (1988, 1991) and initially applied by 
Johansen and Juselius (1990, 1992), and a similar approach was introduced 
independently by Ahn and Reinsel (1988, 1990). Johansen (1995) provides a 
detailed exposition. An advantage of these methods is that they can allow for 
more than one cointegrating relation among a set of more than two variables. 
Consider the VAR 

p+1 

Y; = XB +X Yı: 8; +U, (14.59) 

i=1 
where Y; is a 1 x g vector of observations on the levels of a set of variables, 
each of which is assumed to be I(1), X; (which may or may not be present) is 
a row vector of deterministic variables, such as a constant term and a trend, 
B is a matrix of coefficients of those deterministic regressors, U; is a 1 x g 
vector of error terms, and the ®; are g x g matrices of coefficients. 


The VAR (14.59) is written in levels. It can be reparametrized as 


p 
AY, = XB + Y; HI + X AY; T; + U,, (14.60) 

i=1 
where it is not difficult to verify that Ip = —®p41, T; = Di4i1 — ®i41 for 


i=1,...,p, and 
p+1 


ae a 


Equation (14.60) is the multivariate analog of the augmented Dickey-Fuller 
test regression (14.33). In that regression, we tested the null hypothesis of 
a unit root by testing whether the coefficient of y+—1 is 0. In very much the 
same way, we can test whether and to what extent the variables in Y; are 
cointegrated by testing hypotheses about the g x g matrix JT, which is called 
the impact matrix. 


If we assume, as usual, that the differenced variables are I(0), then everything 
in equation (14.60) except the term Y;_,J7 is I(0). Therefore, if the equation 
is to be satisfied, this term must be I(0) as well. It clearly is so if the matrix 
IT is a zero matrix. In this extreme case, there is no cointegration at all. 
However, it can also be I(0) if IZ is nonzero but does not have full rank. In 
fact, the rank of IT is the number of cointegrating relations. 


To see why this is so, suppose that the matrix IT has rank r, with 0 <r < g. 
In this case, we can always write 


IT = na’, (14.61) 
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where 7 and a are both g x r matrices. Recall that the rank of a matrix 
is the number of linearly independent columns. Here, any set of r linearly 
independent columns of JT is a set of linear combinations of the r columns 
of 7. See also Exercise 14.19. When equation (14.61) holds, we see that 
Y, 17 = Y;_;na'. This term is I(0) if and only if the r columns of Y;_-1n 
are I(0). Thus, for each of the r columns n; of n, Y:-17; is 1(0). In other 
words, 7; is a cointegrating vector. Since the 7; are linearly independent, it 
follows that there are r independent cointegrating relations. 


We can now see just how the number of cointegrating vectors is related to 
the rank of the matrix IT. In the extreme case in which r = 0, there are 
no cointegrating vectors at all, and IT = O. When r = 1, there is a single 
cointegrating vector, which is proportional to 7,. When r = 2, there is a 
two-dimensional space of cointegrating vectors, spanned by 7; and n2. When 
r = 3, there is a three-dimensional space of cointegrating vectors, spanned 
by 1, N2, and 73, and so on. Our assumptions exclude the case with r = g, 
since we have assumed that all the elements of Y; are I(1). If r = g, every 
linear combination of these elements would be stationary, which implies that 
all the elements of Y; are I(0). 


The system (14.60) with the constraint (14.61) imposed can be written as 


Pp 
AY, = XB + Y, ina + `> AY, T; + U. (14.62) 
i=l 


Estimating this system of equations yields estimates of the r cointegrating 
vectors. However, it can be seen from (14.62) that not all of the elements of 
n and a can be identified, since the factorization (14.61) is not unique for a 
given JT. In fact, if O is any nonsingular r x r matrix, 


700 'a'=na'. (14.63) 


It is therefore necessary to make some additional assumption in order to con- 
vert equation (14.62) into an identified model. 


We now consider the simpler case in which g = 2, r = 1, and p = 0. In this 
case, the VAR (14.60) becomes 


Aye = Xibi + m1 ye-1,1 + 721 Yt-1,2 + Uti, (14.64) 


Ayr = X¢b2 + Ti2Yt—1,1 + 722 Yt-1,2 + Ure, 


in obvious notation. If one forgets for a moment about the terms X;6;, this 
pair of equations can be deduced from the model (14.36), with 721 = 12, 
T12 = Q21, and Ti; = Qi —1, i = 1,2. We saw in connection with the system 
(14.36) that, if yz and y2 are cointegrated, then the matrix ® of (14.37) has 
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one unit eigenvalue and the other eigenvalue less than 1 in absolute value. 
This requirement is identical to requiring the matrix 


Til T21 

M2 722 
to have one zero eigenvalue and the other between —2 and 0. Let the zero 
eigenvalue correspond to the eigenvector [72 i 1]. Then it follows that 


T21 = —72711 and 722 = —12712- 


Thus the pair of equations corresponding in this special case to the set of 
equations (14.62), incorporating an identifying restriction, is 


Ayu = Xib + W111 (yt-1,1 — 12 yt—-1,2) + Ua, (14.65) 


Ayr = Xpb2 + T12(Yt—1,1 — N2Ye-1,2) + U2, 


from which it is clear that the cointegrating vector is [1 į —n2]. 


Unlike equations (14.64), the restricted equations (14.65) are nonlinear. There 
are at least two convenient ways to estimate them. One is first to estimate 
the unrestricted equations (14.64) and then use the GNR (12.53) discussed 
in Section 12.3, possibly with continuous updating of the estimate of the 
contemporaneous covariance matrix. Another is to use maximum likelihood, 
under the assumption that the error terms uz, and wz are jointly normally 
distributed. This second method extends straightforwardly to the estimation 
of the more general restricted VAR (14.62). The normality assumption is not 
really restrictive, since the ML estimator is a QMLE even when the normality 
assumption is not satisfied; see Section 10.4. 


Maximum likelihood estimation of a system of nonlinear equations was treated 
in Section 12.3. We saw there that one approach is to minimize the deter- 
minant of the matrix of sums of squares and cross-products of the residuals. 
The hard work can be restricted to the minimization with respect to 72, since, 
for fixed 72, the regression functions in (14.65) are linear with respect to the 
other parameters. As functions of 72, then, the residuals can be written as 
Mx.,Ay;i, where the y;, for i = 1,2, are n-vectors with typical elements yti, 
and v is an n-vector with typical element y:—1,1 — 72Yt—1,2, for the given n2. 
Here Mx, denotes an orthogonal projection on to $+([X v]). 


For simplicity, we suppose for the moment that X is an empty matrix. The 
general case will be dealt with in more detail in the next section. Then the 
determinant that we wish to minimize with respect to 72 is the determinant 
of the matrix AY'M,AY, where AY = [Ay, Ayo]. A certain amount 
of algebra (see Exercise 14.22) shows that this determinant is equal to the 
determinant of AY 'AY times the ratio 


v' May v 
k =... 


14.66 
we (14.66) 
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Since AY 'AY depends only on the data and not on 7, it is enough to 
minimize & with respect to 72. The notation « is intended to be reminiscent 
of the notation used in Section 12.5 in the context of LIML estimation, since 
the algebra of LIML is very similar to that used here. In the present simple 
case, the first-order condition for minimizing «x reduces to a quadratic equation 
for 72. Of the two roots of this equation, we select the one for which the value 
of k given by equation (14.66) is smaller; see Exercise 14.23 for details. 


As with the other methods we have discussed, estimating a cointegrating vec- 
tor by a VAR yields a super-consistent estimator. Bias is in general less than 
with either the levels estimator (14.46) or the ECM estimator obtained by 
running regression (14.52). For small sample sizes, there appears to be a ten- 
dency for there to be outliers in the left-hand tail of the distribution, leading 
to a higher variance than with the other two methods. This phenomenon 
apparently disappears for samples of size greater than about 100, however; 
see Exercise 14.24. 


14.6 Testing for Cointegration 


The three methods discussed in the last section for estimating a cointegrating 
vector can all be extended to provide tests for whether cointegrating relations 
exist for a set of I(1) variables, and, in the case in which a VAR is used, to 
determine how many such relations exist. We begin with a method based on 
the cointegrating regression (14.44). 


Engle-Granger Tests 


The simplest, and probably still the most popular, way to test for cointe- 
gration was proposed by Engle and Granger (1987). The idea is to estimate 
the cointegrating regression (14.44) by OLS and then subject the resulting 
estimates of v; to a Dickey-Fuller test, which is usually augmented to deal 
with serial correlation. We saw in the last section that, if the variables Y, are 
cointegrated, then the OLS estimator of 72 from equation (14.44) is super- 
consistent. The residuals % are then super-consistent estimators of the par- 
ticular linear combination of the elements of Y; that is I(0). If, however, the 
variables are not cointegrated, there is no such linear combination, and the 
residuals, being a linear combination of I(1) variables, are themselves I(1). 
Therefore, they have a unit root. Thus, when we subject the series % to a 
unit root test, the null hypothesis of the test is that v, does have a unit root, 
that is, that the variables in Y; are not cointegrated. 


It may seem curious to have a null hypothesis of no cointegration, but this 
follows inevitably from the nature of any unit root test. Recall from the simple 
model (14.36) that, when there is no cointegration, the matrix ® of (14.37) 
is restricted so as to have two unit eigenvalues. The alternative hypothesis of 
cointegration implies that there is just one, the only constraint on the other 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


14.6 Testing for Cointegration 627 


eigenvalue being that its absolute value should be less than 1. It is therefore 
natural from this point of view to have a test with a null hypothesis of no 
cointegration, with the restriction that there are two unit roots, against an 
alternative of cointegration, with only one. This feature applies to all the 
tests for cointegration that we consider. 


The first step of the Engle-Granger procedure is to obtain the residuals f; 
from regression (14.44). An augmented Engle-Granger (EG) test is then 
performed in almost exactly the same way as an augmented Dickey-Fuller 
test, by running the regression 


P 
Af, = Xey + B'r + X ADi + er, (14.67) 
j=1 


where p is chosen to remove any evidence of serial correlation in the residuals. 
As with the ADF test, the test statistic may be either a 7 statistic or a 
z statistic, although the former is more common. We let Te(g) denote the 
t statistic for 8’ = 0 in (14.67) when X; contains only a constant term and 
the vector 72 has g— 1 elements to be estimated. Similarly, Tne(g), Tee(g), and 
Tat(g) denote t statistics for the same null hypothesis, where the indicated 
deterministic terms are included in X;. By the same token, znc(g), ze(g), 
Zet(g), and Zet4(g) denote the corresponding z statistics. As before, these are 
defined by equation (14.34). 


As the above notation suggests, the asymptotic distributions of these test 
statistics depend on g. When g = 1, we have a limiting case, since there is then 
only one variable, y;, which is I(1) under the null hypothesis and I(0) under the 
alternative. Not surprisingly, for g = 1, the asymptotic distribution of each of 
the Engle-Granger statistics is identical to the asymptotic distribution of the 
corresponding Dickey-Fuller statistic. To see this, note that the residuals ĉ, 
are in this case just y+ itself projected off whatever is in X;. The result then 
follows from the FWL Theorem, which implies that regressing y; on X; and 
then running regression (14.67) is the same (except for the initial observations) 
as directly running an ADF testing regression like (14.32). If there is more 
than one variable, but some or all of the components of the cointegrating 
vector are known, then the proper value of g is 1 plus the number of parameters 
to be estimated in order to estimate 72. Thus, if all the parameters are known, 
we have g = 1 whatever the number of variables. 


Figure 14.4 shows the asymptotic densities of the 7.(g) tests for g =1,...,12. 
The densities move steadily to the left as g, the number of possibly cointe- 
grated variables, increases. In consequence, the critical values become larger 
in absolute value, and the power of the test diminishes. The other Engle- 
Granger tests display similar patterns. 


Since a set of g I(1) variables is cointegrated if there is a linear combination 
of them that is I(0), any g independent linear combinations of the variables 
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Figure 14.4 Asymptotic densities of Engle-Granger Te tests 


is also a cointegrated set. In other words, cointegration is a property of the 
linear space spanned by the variables, not of the particular choice of variables 
that span the space. A problem with Engle-Granger test statistics is that they 
depend on the particular choice of Y;2 in the first step regression (14.44), or, 
more precisely, on the linear subspace spanned by the variables in Y;2. The 
asymptotic distribution of the test statistic under the null hypothesis is the 
same regardless of how Y;2 is chosen, but the actual test statistic is not. 
Consequently, Engle-Granger tests with the same data but different choices 
of Y; can, and often do, lead to quite different inferences. 


ECM Tests 


A second way to test for cointegration involves the estimation of an error- 
correction model. We can base an ECM test for the null hypothesis that the 
set of variables Y, = [y+ Yj2| is not cointegrated on equation (14.53). If no 
linear combination of the variables in Y; is I(0), then the coefficients ô and 62 
in that equation must be zero. A suitable test statistic is thus the t statistic 
for ô = 0. Of course, since the regressor y+—1 is I(1), this ECM statistic 
does not follow the N(0,1) distribution asymptotically. Instead, if Y, is a 
1 x g vector, it follows the distribution that Ericsson and MacKinnon (2002) 
call the xka(g) distribution, where d is one of nc, c, ct, or ctt, depending on 
which deterministic regressors are included in X;. 


When g = 1, the asymptotic distribution of the ECM statistic is identical to 
that of the corresponding Dickey-Fuller 7 statistic. This follows immediately 
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Figure 14.5 Asymptotic densities of ECM «xe tests 


from the fact that, for g = 1, equation (14.53) collapses to 
Ay, = Xy + y1 + et, 


which is equivalent to equation (14.31). However, when k > 1, the distribu- 
tions of the various & statistics are not the same as those of the corresponding 
Engle-Granger T statistics. 


Equation (14.53) is less likely to suffer from serial correlation than the Engle- 
Granger test regression (14.67) because the error-correction term often has 
considerable explanatory power when there really is cointegration. If serial 
correlation is a problem, one can add lagged values of both Ay and AY; 
to equation (14.53) without affecting the asymptotic distributions of the test 
statistics. Indeed, one can add any stochastic variable that is I(0) and exogen- 
ous or predetermined, as well as nontrending deterministic variables. Thus 
it is possible to perform ECM tests within the context of a well-specified 
econometric model, of which equation (14.53) is a special case. Indeed, this is 
probably the best way to perform such a test, and it is one of the things that 
makes ECM tests attractive. 


Figure 14.5 shows the densities of the «,(g) statistics for g = 1,...,12. This 
figure is comparable to Figure 14.4. It can be seen that, for g > 1, the 
critical values are somewhat smaller in absolute value than they are for the 
corresponding EG tests. The distributions of the « statistics are also more 
spread out than those of the corresponding 7 statistics, with positive values 
much more likely to occur. 
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Under the alternative hypothesis of cointegration, an ECM test is more likely 
to reject the false null than an EG test. Consider equation (14.52). Subtract- 
ing nz Ayr from both sides and rearranging, we obtain 


A(yea — N2yt2) = 61 (Ye-1,1 — N2Yt-1,2) + (@ — No) Ayto + €r. (14.68) 


If we replace 72 by its estimate (14.46) and omit the term (a@ — 72)Ayo, this 
is just a version of the Engle-Granger test regression (14.67). We remarked 
in our discussion of the estimation of 72 by an ECM that the restriction that 
Q = m is often too strong for comfort. When this restriction is false, we may 
expect (14.68) to fit better than (14.67) and to be less likely to suffer from 
serially correlated errors. Thus we should expect the EG test to have less 
power than the ECM test in most cases. It must be noted, however, that the 
ECM test shares with the EG test the disadvantage that it depends on the 
particular choice of Yio. 


For more detailed discussions of ECM tests, see Campos, Ericsson, and Hendry 
(1996), Banerjee, Dolado, and Mestre (1998), and Ericsson and MacKinnon 
(2002). The densities graphed in Figure 14.5 are taken from the last of these 
papers, which provides programs that can be used to compute critical values 
and P values for these tests. 


Tests Based on a Vector Autoregression 


A third way to test for cointegration is based on the VAR (14.60). The idea 
is to estimate this VAR subject to the constraint (14.61) for various values 
of the rank r of the impact matrix I, using ML estimation based on the 
assumption that the error vector U; is multivariate normal for each t and 
independent across observations. Null hypotheses for which there are any 
number of cointegrating relations from 0 to g — 1 can then be tested against 
alternatives with a greater number of relations, up to a maximum of g. Of 
course, if there really were g cointegrating relations, all the variables would 
be I(0), and so this case is usually only of theoretical interest. The most 
convenient test statistics are likelihood ratio (LR) statistics. 


We saw in the last section that a convenient way to obtain ML estimates of 
the restricted VAR (14.62) is to minimize the determinant of the matrix of 
sums of squares and cross-products of the residuals. We now describe how to 
do this in general, and how to use the result in order to compute estimates 
of sets of cointegrating vectors and LR test statistics. We will not enter into 
a discussion of why the recipes we provide work, since doing so would be 
rather complicated. But, since the methodology is in very common use in 
practice, we will give detailed instructions as to how it can be implemented. 
See Banerjee, Dolado, Galbraith, and Hendry (1993, Chapter 8), Davidson 
(2000, Chapter 16), and Johansen (1995) for more detailed treatments. 


The first step is to concentrate the B and I; parameters out of the VAR 
(14.62). We can do this by regressing both AY; and Y;_, on the deterministic 
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variables X, and the lags AY;_; through AY;_,. This requires us to run 
2g OLS regressions, all of which involve the same regressors, and yields two 
sets of residuals, 


Va = AY, — AY,, and 
4 g (14.69) 
iosta- in 


where Va and Vg are both 1 x g vectors. In equations (14.69), AY, and Y 
denote the fitted values from the regressions on X; and AY;_; through AY;—p. 


The next step is to compute the g x g sample covariance matrices 


Sasi Y VV j=1,2, 1=1,2. 


t=1 
Then we must find the solutions A; and z;, for i = 1,...,g, to the equations 
Oid = So 37) S12) Zi = 0, (14.70) 


which are similar to equations (12.115) for finding the eigenvalues and eigen- 
vectors of a matrix. The eigenvalue-eigenvector problem we actually solve is 
for the positive definite symmetric matrix 


A= ÊL Sy Di Si D5; (14.71) 


where Doo AN = 3. The eigenvalues of this matrix turn out to be the A; 
that we seek. We sort these from largest to smallest, so that A; > A; for i < j. 
Then we choose the corresponding eigenvectors to be the columns of a g x g 
matrix W which is such that W'W = I; see Exercise 14.25. The eigenvalue- 
eigenvector relation implies that AW = WA, where the diagonal entries of 
the diagonal matrix A are the (ordered) eigenvalues \;. It is then easy to show 
that the columns z; of the matrix Z = WoW solve the equations (14.70) along 
with the \;, and that the matrix Z satisfies the relation 


Z\ SoZ =L. (14.72) 


The purpose of solving equations (14.70) in this way is that the first r columns 
of Z are the ML estimates 7 of 7, with equations (14.72) providing the nec- 
essary identifying restrictions so that œ and 7 are uniquely determined; recall 
the indeterminacy expressed by equation (14.63). As we remarked in the last 
section, once 7 is given, the equations (14.62) are linear in the other para- 
meters, which can therefore be estimated by least squares. 


It can be shown that the maximized loglikelihood function for the restricted 
model (14.62) is 


-Z (log 2m +1) — Ž $` log(1 — Ai). (14.73) 
i=1 
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Thus we can calculate the maximized loglikelihood function for any value of 
the number of cointegrating vectors, once we have found the eigenvalues of 
the matrix (14.71). For given r, (14.73) depends on the r largest eigenvalues. 
Note that it must be the case that 0 < A; < 1 for all i, because the matrix A 
is positive definite, and because, if A; > 1, the loglikelihood function (14.73) 
would not exist. 


As r increases, so does the value of the maximized loglikelihood function 
given by expression (14.73). This makes sense, since we are imposing fewer 
restrictions. To test the null hypothesis that r = rı against the alternative 
that r = re, for rı < re < g, we compute the LR statistic 


=n 3 log(1 — à;). (14.74) 


i=fi +1 


This is often called the trace statistic, because it can be thought of as the sum 
of a subset of the elements on the principal diagonal of the diagonal matrix 
—nlog(I — A). Because the impact matrix JT cannot be written as a matrix 
of coefficients of I(0) variables (recall the discussion in the last section), the 
distributions of the trace statistic are nonstandard. These distributions have 
been tabulated for a number of values of rə — rı. Typically, the trace statistic 
is used to test the null hypothesis that there are r cointegrating vectors against 
the alternative that there are g of them. 


When the null hypothesis is that there are r cointegrating vectors and the 
alternative is that there are r + 1 of them, there is just one term in the sum 
that appears in expression (14.74). The test statistic is then 


—nlog(1 — A-41) = —nlog(1 — Amax), (14.75) 


where Amax is the largest eigenvalue of those that correspond to eigenvectors 
which have not been incorporated into 7 under the null hypothesis. For 
obvious reasons, this test statistic is often called the Amax statistic. The 
distributions of this statistic for various values of r have been tabulated. 


Like those of unit-root tests and single-equation cointegration tests, the 
asymptotic distributions of the trace and Amax statistics depend on what 
deterministic regressors are included in X;. To complicate matters, it may 
well be desirable to impose restrictions on the matrix B, and the distributions 
also depend on what restrictions, if any, are imposed. 


A further complication is that some of the I(1) variables may be known not 
to be cointegrated. In that case, we can divide Y; into two parts, treating the 
variables in one part as exogenous and those in the other part as potentially 
cointegrated. The distributions of the test statistics then depend on how many 
exogenous variables there are. For details, see Harbo, Johansen, Nielsen, and 
Rahbek (1998) and Pesaran, Shin, and Smith (2000). 
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Figure 14.6 Asymptotic densities of some Amax tests 


Figure 14.6 shows the densities of the Amax statistics for the null hypotheses 
that r = 0,1,2,3,4,5 under one popular assumption about B, namely, that 
X; consists only of a constant, and that there are certain restrictions on B. 
This was called “Case II” by Pesaran, Shin, and Smith (2000) and “Case 1*” 
by Osterwald-Lenum (1992). We see from the figure that the mean and var- 
iance of the Amax Statistic become larger as r increases, and that its density 
becomes more symmetrical. The mean and variance of the trace statistic, 
which coincides with the Ama, statistic when g — r = 1, increase even more 
rapidly as g — r increases. Figure 14.6 is based on results from MacKinnon, 
Haug, and Michelis (1999), which provides programs that can be used to com- 
pute asymptotic critical values and P values for the Amax and trace statistics 
for all the standard cases, including systems with exogenous I(1) variables. 


Unlike EG and ECM tests, tests based on the trace or Amax statistics are 
invariant when the variables Y; are replaced by independent linear combina- 
tions of them. We will not take the time to prove this important property, 
but it is a reasonably straightforward consequence of the definitions given in 
this section. Intuitively, it is a consequence of the fact that no particular 
variable or linear combination of variables is singled out in the specification 
of the VAR (14.62), in contrast to the specifications of the regressions used 
to implement EG and ECM tests. 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


634 Unit Roots and Cointegration 


14.7 Final Remarks 


This chapter has provided a reasonably brief introduction to the modeling 
of nonstationary time series, a topic which has engendered a massive liter- 
ature in a relatively short period of time. A deeper treatment would have 
required a book instead of a chapter. The asymptotic theory that is applica- 
ble when some variables have unit roots is very different from the conventional 
asymptotic theory that we have encountered in previous chapters. Moreover, 
the enormous number of different tests, each with its own nonstandard limit- 
ing distribution, can be intimidating. However, we have seen that the same 
fundamental ideas underlie many of the techniques for both estimation and 
hypothesis testing in models that involve variables which have unit roots. 


14.8 Exercises 


14.1 Calculate the autocovariance E(wrws), s < t, of the standardardized random 
walk given by (14.01). 

14.2 Suppose that (1 — p(L))uet = ez is the autoregressive representation of the 
series uz, where ez is white noise, and p(z) is a polynomial of degree p with 
no constant term. If u¢ has exactly one unit root, show that the polynomial 
1 — p(z) can be factorized as 


p= p(z) = (1 = z)(1 = po(2)), 


where 1—po(z) is a polynomial of degree p— 1 with no constant term and all its 
roots strictly outside the unit circle. Give the autoregressive representation of 
the first-differenced series (1 — L)uz¢, and show that it implies that this series 
is stationary. 


14.3 Establish the three results 


n n 


S-t=4ninti), YP = inntin), Y= inn 
t=1 


t=1 t=1 


by inductive arguments. That is, show directly that the results are true for 
n = 1, and then for each one show that, if the result is true for a given n, it 
is also true for n+ 1. 


14.4 Consider the following random walk, in which a second-order polynomial in 
time is included in the defining equation: 


yt = Bo + bit + Bot? +y-1+ uz, ut ~ IID(0, 0°). 


Show that y can be generated in terms of a standardized random walk wz 
that satisfies (14.01) by the equation 


yt = yo + Bot + Pist(t + 1) + attt + 1)(2t +1) + ow. 


Can you obtain a similar result for the case in which the second-order poly- 
nomial is replaced by a polynomial of degree p in time? 
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14.5 


14.8 


14.9 


For sample sizes of 50, 100, 200, 400, and 800, generate N pairs of data from 
the DGP 


Yt = 1yt-1+Ut1, Yo= 9, wa ~ NID(O, 1), 
Lt = p2Tt-1 + Ut2, LO = 0, ut2 ~ NID(0, 1); 


for the following values of pı and p2: —0.7, 0.0, 0.7, and 1. Then run regression 
(14.12) and record the proportion of the time that the ordinary t test for 
B2 = 0 rejects the null hypothesis at the .05 level. Thus you need to perform 
16 experiments for each of 5 sample sizes. Choose a reasonably large value 
of N, but not so large that you use an unreasonable amount of computer time. 
The smallest value that would probably make sense is N = 10,000. 


For which values of pı and p2 does it seem plausible that the t test based 
on the spurious regression (14.12) rejects the correct proportion of the time 
asymptotically? For which values is it clear that the test overrejects asymp- 
totically? Are there any values for which it appears that the test underrejects 
asymptotically? 


Was it really necessary to run all 16 experiments? Explain. 


Repeat the previous exercise using regression (14.13) instead of regression 
(14.12). For which values of pı and p2 does it seem plausible that the t test 
based on this regression rejects the correct proportion of the time asymptot- 
ically? For which values is it clear that the test overrejects asymptotically? 
Are there any values for which it appears that the test underrejects asymp- 
totically? 


Repeat some of the experiments in Exercise 14.5 with p1 = p2 = 0.8, using 
a HAC covariance matrix estimator instead of the OLS covariance matrix 
estimator for the computation of the t statistic. A reasonable rule of thumb 
is to set the lag truncation parameter p equal to three times the fourth root 
of the sample size, rounded to the nearest integer. You should also do a few 
experiments with sample sizes between 1,000 and 5,000 in order to see how 
slowly the behavior of the t test approaches its nominal asymptotic behavior. 


Repeat exercise 14.7 with unit root processes in place of stationary AR(1) pro- 
cesses. You should find that the use of a HAC estimator alleviates the extent 
of spurious regression, in the sense that the probability of rejection tends to 1 
more slowly as n — oo. Intuitively, why should using a HAC estimator work, 
even if only in very large samples, with stationary AR(1) processes but not 
with unit root processes? 


The HAC estimators used in the preceding two exercises are estimates of the 
covariance matrix 


Le" X) eX Ox xy (14.76) 


where (2 is the true covariance matrix of the error terms. Do just a few 
experiments for sample sizes of 20, 40, and 60, with AR(1) variables in some 
and unit root variables in others, in which you use the true 92 in (14.76) rather 
than using a HAC estimator. Hint: The result of Exercise 7.10 is useful for 
the construction of X'QX. You should find that the rejection rate is very 
close to nominal even for these small samples. 
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14.10 


14.11 


14.12 


14.13 


14.14 


Unit Roots and Cointegration 


Consider the model with typical DGP 
P . 
y =X Bit +1 toen e~ ID(0,1). (14.77) 
i=0 


Show that the z and 7 statistics from the testing regression 


p+1 
Ayi = So vt! + (8 = 1)yt—1 + et 
i=0 


are pivotal if the DGP is (14.77) and the distribution for the white-noise 
process ¢€; is known. 


Show that 
n n n t—-l 
Sow? =Y (n-t+1)e +25 Y m-tt eres, 
t=1 t=1 t=2 s=1 


where wz is the standardized random walk (14.02). Demonstrate that any 
pair of terms from either sum on the right-hand side of the above expression 
are uncorrelated. Let the fourth moment of the white-noise process €4 be m4. 
Then show that the variance of Yai w2 is equal to 


Minnt 1)(2n +1) + $n? (n? — 1), 


of order n* as n — oo. Hint: Use the results of Exercise 14.3. 


Consider the standardized Wiener process W (r) defined by (14.26). Show 
that, for 0 < rı < rə < r3 < r4 < 1, W (r2) — W (r1) and W (r4) — W (r3) are 
independent. This property is called the property of independent increments 
of the Wiener process. Show that the covariance of W (r) and W (s) is equal 
to min(r, s). 

The process G(r), r € [0,1], defined by G(r) = W (r) — rW (1), where W (r) 
is a standardized Wiener process, is called a Brownian bridge. Show that 
G(r) ~ N(0,r(1—1)), and that the covariance of G(r) and G(s) is s(1 — r) 
forr > s. 


By using arguments similar to those leading to the result (14.29), demonstrate 
the result (14.30). For this purpose, the result of Exercise 4.8 may be helpful. 


Show that, if wz is the standardized random walk (14.01), ar wt is of 
order n°/? as n — oo. By use of the definition (14.28) of the Riemann 
integral, show that 


Demonstrate that this plim is distributed as N (0, 1/3). 
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14.15 


14.16 


14.17 


14.18 


14.19 


14.20 


Show that the probability limit of the formula (14.20) for the statistic ze can 
be written in terms of a standardized Wiener process W(r) as 


plim ze = 3(W?(1) - 1) — WA) fy W(r) ar 
n—00 i, W?2(r) dr — ‘Gh Wir) any 


The file intrates-m.data contains several monthly interest rate series for the 
United States from 1955 to 2001. Let R denote the 10-year government bond 
rate. Using data for 1957 through 2001, test the hypothesis that this series 
has a unit root with ADF Tc, Tet, Tett, Ze, Zct, and Zett tests, using whatever 
value(s) of p seem reasonable. 


Consider the simplest ADF testing regression 
Ay: = 8 ye-1 + Ay- + et, 


and suppose that the data are generated by the simplest random walk: 
yt = we, where w+ is the standardized random walk (14.01). If Pı is the 
orthogonal projection on to the lagged dependent variable Ay;_1, and if w_1 
is the n-vector with typical element w+— 1, show that the expressions 


n n 
1 y 1 y 
n (Piw—ihEt and n Wt—-1Et 
t=1 t=1 


have the same probability limit as n — oo. Derive the same result for the two 
expressions 


n2 (Pıw-ı) and — X wit 
t=1 t=1 
Let the p x p matrix A have q distinct eigenvalues \1,...,Aq, where q < p. 
Let the p-vectors x;, 7 = 1,...,q, be corresponding eigenvectors, so that 


Ax; = Aixi. Prove that the x; are linearly independent. 


Show that the expression nt paar ve1Ut2, Where vz; and v2 are given 
by (14.41), has an expectation and a variance which both tend to finite limits 
as n — oo. For the variance, the easiest way to proceed is to express the v;; as 
in (14.41), and to count the number of nonzero contributions to the variance. 


If the p x q matrix A has rank r, where r < p and r < q, show that there 
exist a p x r matrix B and aq x r matrix C, both of full column rank r, such 
that A = BC’. Show further that any matrix of the form BC, where B is 
pxr withr < pand C is q xr with r < q, has rank r if both B and C have 
rank r. 


Generate two I(1) series y1 and yg using the DGP given by (14.45) with 
£11 = %21 = 1, zı2 = 0.5, and z22 = 0.3. The series vi; and v2 should be 
generated by (14.41), with A; = 1 and Az = 0.7, the series e¢; and et2 being 
white noise with a contemporaneous covariance matrix 


5- | 1 ak 


0.7 1.5 
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14.21 


14.22 


14.23 


14.24 


14.25 


14.26 


14.27 


Unit Roots and Cointegration 


Perform a set of simulation experiments for sample sizes n = 30, 50, 100, 
200, and 500 in which the parameter 72 of the stationary linear combination 
Ytl — N2yt2 is estimated first by (14.46), and then as —ô2/ô1 from the regres- 
sion (14.52). You should observe that the first estimator is substantially more 
biased than the second. 


Verify the super-consistency of both estimators by computing the first two 
moments of n(2— n2) and showing that they are roughly constant as n varies, 
at least for larger values of n. 


Show that, if 
n™/? (6; — 6i) 
ĝi 


for i = 1,2, then the ratio ô2/ô1 is super-consistent. In other words, show 
that equation (14.56) holds. 


Let A = [a1 ag] be an n x 2 matrix, and let v be an n-vector. Show that the 
determinant of the 2 x 2 matrix A'M. vA, where M, projects orthogonally on 
to $+(v), is equal to the determinant of A'A multiplied by v'Mav/v'v. In 
your calculation, it is helpful to exploit the fact that $(v) is one-dimensional 
and to compute the explicit inverse of A'A in terms of the scalar products 
aiaj, i,j =1,2. 


=t+O(n/%), 


Show that the first-order condition for minimizing the « given in expression 
(14.66) with respect to 72, where v = y1 — 72y2, is equivalent to requiring 
that 72 should be a solution to the quadratic equation 


2, T F T T 
n2 (yı yo y2 May yz — yı May Y2 Y2 y2) 
T T T T 
+n(y2 y2 yı Mayyı — yı yı y2 Mayy2) 
T i T 
+ (yl y1 y1 May y2 — yi May yi yi y2) = 0. (14.78) 


Repeat the simulation experiments of Exercise 14.20 for the VAR estimator 
of the parameter 72 of the cointegration relation. The easiest way to proceed 
is to solve the quadratic equation (14.78), choosing the root for which « is 
smallest. 


Let the p x p matrix A be symmetric, and suppose that A has two distinct 
eigenvalues A; and Ag, with corresponding eigenvectors z1 and z2. Prove that 
zı and z2 are orthogonal. 


Use this result to show that there is a g x g matrix Z, with Z'Z =I (that 
is, Z is an orthogonal matrix), such that AZ = ZA, where A is a diagonal 
matrix the entries of which are the eigenvalues of A. 


Let r; denote the logarithm of the 10-year government bond rate, and let sz 
denote the logarithm of the 1-year government bond rate, where monthly data 
on both rates are available in the file intrates-m.data. Using data for 1957 
through 2001, use whatever augmented Engle-Granger T tests seem appropri- 
ate to test the null hypothesis that these two series are not cointegrated. 


Consider once again the Canadian consumption data in the file consump- 
tion.data, for the period 1953:1 to 1996:4. Perform a variety of appropriate 
tests of the hypotheses that the levels of consumption and income have unit 
roots. Repeat the exercise for the logs of these variables. 
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If you fail to reject the hypotheses that the levels or the logs of these variables 
have unit roots, proceed to test whether they are cointegrated, using two ver- 
sions of the EG test procedure, one with consumption, the other with income, 
as the regressand in the cointegrating regression. Similarly, perform two ver- 


sions of the ECM test. Finally, test the null hypothesis of no cointegration 
using Johansen’s VAR-based procedure. 
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Chapter 15 


Testing the Specification 
of Econometric Models 


15.1 Introduction 


As we first saw in Section 3.7, estimating a misspecified regression model 
generally yields biased and inconsistent parameter estimates. This is true for 
regression models whenever we incorrectly omit one or more regressors that 
are correlated with the regressors included in the model. Except in certain 
special cases, some of which we have discussed, it is also true for more general 
types of model and more general types of misspecification. This suggests 
that the specification of every econometric model should be thoroughly tested 
before we even tentatively accept its results. 


We have already discussed a large number of procedures that can be used 
as specification tests. These include t and F tests for omitted variables and 
for parameter constancy (Section 4.4), along with similar tests for nonlinear 
regression models (Section 6.7) and IV regression (Section 8.5), tests for het- 
eroskedasticity (Section 7.5), tests for serial correlation (Section 7.7), tests 
of common factor restrictions (Section 7.9), DWH tests (Section 8.7), tests 
of overidentifying restrictions (Sections 8.6, 9.4, 9.5, 12.4, and 12.5), and the 
three classical tests for models estimated by maximum likelihood, notably LM 
tests (Section 10.6). 


In this chapter, we discuss a number of other procedures that are designed 
for testing the specification of econometric models. Some of these procedures 
explicitly involve testing a model against a less restricted alternative. Others 
do not make the alternative explicit and are intended to have power against a 
large number of plausible alternatives. In the next section, we discuss a variety 
of tests that are based on artificial regressions. Then, in Section 15.3, we 
discuss nonnested hypothesis tests, which are designed to test the specification 
of a model when alternative models are available. In Section 15.4, we discuss 
model selection based on information criteria. Finally, in Section 15.5, we 
introduce the concept of nonparametric estimation. Nonparametric methods 
avoid specification errors caused by imposing an incorrect functional form, 
and the validity of parametric models can be checked by comparing them 
with nonparametric ones. 
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15.2 Specification Tests Based on Artificial Regressions 


In previous chapters, we have encountered numerous examples of artificial 
regressions. These include the Gauss-Newton regression (Section 6.7) and 
its heteroskedasticity-robust variant (Section 6.8), the OPG regression (Sec- 
tion 10.5), and the binary response model regression (Section 11.3). We can 
write any of these artificial regressions as 


r(@) = R(0)b + residuals, (15.01) 


where @ is a parameter vector of length k, r(@) is a vector, often but by no 
means always of length equal to the sample size n, and R(@) is a matrix with 
as many rows as r(@) and k columns. For example, in the case of the GNR, 
r(@) is a vector of residuals, written as a function of the data and parameters, 
and R(@) is a matrix of derivatives of the regression function with respect to 
the parameters. 


In order for (15.01) to be a valid artificial regression, the vector r(@) and 
the matrix R(@) must satisfy certain properties, which all of the artificial 
regressions we have studied do satisfy. These properties are given in outline 
in Exercise 8.20, and we restate them more formally here. We use a notation 
that was introduced in Section 9.5, whereby M denotes a model, u denotes a 
DGP which belongs to that model, and plim,, means a probability limit taken 
under the DGP u. See the discussion in Section 9.5. 


An artificial regression of the form (15.01) corresponds to a model M with 
parameter vector 0, and to a root-n consistent asymptotically normal estima- 
tor Ô of that parameter vector, if and only if the following three conditions are 
satisfied. For the last two of these conditions, Ó may be any root-n consistent 
estimator, not necessarily the same as 0. 


e The artificial regressand and the artificial regressors are orthogonal when 
evaluated at 0, that is, oo. 
R'(0)r(6) = 0. 


e Under any DGP pu € M, the asymptotic covariance matrix of 6 is given 
either by 
Var(plim ,, n'/2(@ — 0,)) = plim, (N RÓ RÓJ `, (15.02) 
where 6, is the true parameter vector for the DGP p, n is the sample 
size, and N is the number of rows of r and R, or by 
, $ =1 


Var(plim „ n'/?(6 — 6,,)) = plim „ £ (NTR R(6)) , (15.03) 


n— CO n— CO 


where ś? is the OLS estimate of the error variance obtained by running 
regression (15.01) with 8 = @. 
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° The artificial regression allows for one-step estimation, in the sense that, 
if b denotes the vector of OLS parameter estimates obtained by running 
regression (15.01) with 0 = 0, then, under any DGP p € M, 


plim , n\/?(6 + b — 0,) = plim, n!/2(6 — 0,). (15.04) 
Equivalently, making use of the Op notation introduced in Section 14.2, 
the property (15.04) may be expressed as 0 + b = 0 + Op(n™t}). 


The Gauss-Newton regression for a nonlinear regression model, together with 
the least-squares estimator of the parameters of the model, satisfies the above 
conditions. For the GNR, the asymptotic covariance matrix is given by equa- 
tion (15.03). The OPG regression for any model that can be estimated by 
maximum likelihood, together with the ML estimator of its parameters, also 
satisfies the above conditions, but the asymptotic covariance matrix is given 
by equation (15.02). See Davidson and MacKinnon (2001) for a more detailed 
discussion of artificial regressions. 


Now consider the artificial regression 


z Z 


r(0) = R(Ó)b + Z(6)c + residuals, (15.05) 


where Z = Z(8) is a matrix with r columns that depends on the same sample 
data and parameter estimates as Ý = r(0) and R = R(6). We have previously 
encountered instances of regressions like (15.05), where both R(@) and Z(@) 
were matrices of derivatives, with R(@) corresponding to the parameters of 
a restricted version of the model and Z(0) corresponding to additional para- 
meters that appear only in the unrestricted model. In such a case, if the 
root-n consistent estimator É satisfies the restrictions, then running an arti- 
ficial regression like (15.05) and testing the hypothesis that c = 0 provides 
a way of testing those restrictions; recall the discussion in Section 6.7 in the 
context of the GNR. In many cases, 6 is conveniently chosen as the vector of 
estimates from the restricted model. 

A great many specification tests may be based on artificial regressions of the 
form (15.05). The null hypothesis under test is that the model M to which 
regression (15.01) corresponds is correctly specified. It is not necessary that 
the matrix Z should explicitly be a matrix of derivatives. In fact, any matrix 
Z(@) which satisfies the following three conditions can be used in (15.05) to 
obtain a valid specification test. 


R1. For every DGP p € M, 


plim, N~'Z'(6,)r(6,,) = 0. (15.06) 


n—> oo 


A sufficient condition for (15.06) to hold is E,,(Z/'(8,.)r:(9,.)) = 0 for all 
t= 1,..., N, where N is the number of elements of r, and Z; and r; are, 
respectively, the t*® row and t*™ element of Z and r. 
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R2. Let r,, R,, and Z, denote r(0,,), R(0,), and Z(6,,), respectively. Then, 
for any u € M, if the asymptotic covariance matrix is given by (15.02), 
the matrix _ _ 

R,R, RZ 
plim, t a anes 
pac n| ZIR; 212, 


(15.07) 


is the covariance matrix of the plim of the vector n™[R; r, i Zdra], 
which is required to be asymptotically multivariate normal. If instead the 
asymptotic covariance matrix is given by equation (15.03), then the ma- 
trix (15.07) must be multiplied by the probability limit of the estimated 
error variance from the artificial regression. 


R3. The Jacobian matrix containing the partial derivatives of the elements 
of the vector n~!Z'(@)r(@) with respect to the elements of 0, evalu- 
ated at 6,, is asymptotically equal, under the DGP pu, to -nt Z, Ry. 
Formally, this Jacobian matrix is equal to -n71 Z, R, + Op(n7 12). 


Since a proof of the sufficiency of these conditions requires a good deal of 
algebra, we relegate it to a technical appendix. 


When these conditions are satisfied, we can test the correct specification of 
the model M against an alternative in which equation (15.06) does not hold 
by testing the hypothesis that c = 0 in regression (15.05). If the asymptotic 
covariance matrix is given by equation (15.02), then the difference between the 
explained sum of squares from regression (15.05) and the ESS from regression 
(15.01), evaluated at 6, must be asymptotically distributed as x?(r) under 
the null hypothesis. This is not true when the asymptotic covariance matrix 
is given by equation (15.03), in which case we can use an asymptotic t test if 
r= 1 or an asymptotic F test if r > 1. 


The RESET Test 


One of the oldest specification tests for linear regression models, but one that 
is still widely used, is the regression specification error test, or RESET test, 
which was originally proposed by Ramsey (1969). The idea is to test the null 
hypothesis that 

y = XB +u, u ~ IID(0, 0°), (15.08) 


where the explanatory variables X; are predetermined with respect to the 
error terms uz, against the rather vaguely specified alternative that E(y;|X;z) 
is a nonlinear function of the elements of X;. The simplest version of RESET 
involves regressing y, on X; to obtain fitted values X;@ and then running the 
regression 

ye = XB + (Xib)? + u. (15.09) 


The test statistic is the ordinary t statistic for y = 0. 


At first glance, the RESET procedure may not seem to be based on an artificial 
regression. But it is easy to show (Exercise 15.2) that the t statistic for y = 0 
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in regression (15.09) is identical to the t statistic for c = 0 in the GNR 


ty = Xb + c(X,3)? + residual, (15.10) 


where ù; is the t* residual from regression (15.08). The test regression (15.10) 


is clearly a special case of the artificial regression (15.05), with Ê playing the 
role of @ and (X;3)? playing the role of Z. It is not hard to check that the 
three conditions for a valid specification test regression are satisfied. First, 
the predeterminedness of X; implies that E((Xz80)? (yt — Xßo)) = 0, where 
Bo is the true parameter vector, so that condition R1 holds. Condition R2 
is equally easy to check. For condition R3, let z(@) be the n-vector with 
typical element (X;6)?. Then the derivative of n~!z'(8)(y, — XQ) with 
respect to @;, for i = 1,...,k, evaluated at (Jp, is 


n 


DD Xo L454 Ut — Í N (Xibo) tui. 
t=1 


n 
t=1 


The first term above is n~!/? times an expression which, by a central limit the- 


orem, is asymptotically normal with mean zero and finite variance. It is there- 
fore Op(n~'/?). The second term is an element of the vector —n~tz"(8o)X. 
Thus condition R3 holds, and the RESET test, implemented either by either 
of the regressions (15.09) or (15.10), is seen to be asymptotically valid. 


Actually, the RESET test is not merely valid asymptotically. It is exact in 
finite samples whenever the model that is being tested satisfies the strong 
assumptions needed for t statistics to have their namesake distribution; see 
Section 4.4 for a statement of those assumptions. To see why, note that the 
vector of fitted values X is orthogonal to the residual vector ù, so that 
E(B'X'ù) = 0. Under the assumption of normal errors, it follows that XB 
is independent of w. As Milliken and Graybill (1970) first showed, and as 
readers are invited to show in Exercise 15.3, this implies that the t statistic 
for c= 0 yields an exact test under classical assumptions. 


Like most specification tests, the RESET procedure is designed to have power 
against a variety of alternatives. However, it can also be derived as a test 
against a specific alternative. Suppose that 


x 
we = OX) + Ue, (15.11) 


where ô is a scalar parameter, and T(x) may be any scalar function that is 
monotonically increasing in its argument x and satisfies the conditions 


70) =0; 77 (0) =1; and 7”(0) 40, 


where 7’(0) and 7”(0) are the first and second derivatives of T(x), evaluated 
at x = 0. A simple example of such a function is 


r(x) =a2+2", 
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We first encountered the family of functions 7(-) in Section 11.3, in connection 
with tests of the functional form of binary response models. 


By lHôpital’s Rule, the nonlinear regression model (15.11) reduces to the 
linear regression model (15.08) when ô = 0. It is not hard to show, using 
equations (11.29), that the GNR for testing the null hypothesis that ô = 0 is 


yı — Xi = Xib + e( Xô)? r" (0) + residual, 


which, since T” (0)/2 is just a constant, is equivalent to regression (15.10). 
Thus RESET can be derived as a test of ô = 0 in the nonlinear regression 
model (15.11). For more details, see MacKinnon and Magee (1990), which 
also discusses some other specification tests that can be used to test (15.08) 
against nonlinear models involving transformations of the dependent variable. 


Some versions of the RESET procedure add the cube, and sometimes also the 
fourth power, of XB to the test regression (15.09). This makes no sense if 
the alternative is (15.11), but it may give the test more power against some 
other alternatives. In general, however, we recommend the simplest version 
of the test, namely, the t test for y = 0 in regression (15.09). 


Conditional Moment Tests 


If a model M is correctly specified, many random quantities that are functions 
of the dependent variable(s) should have expectations of zero. Often, these 
expectations are taken conditional on some information set. For example, 
in the linear regression model (15.08), the expectation of the error term uz, 
conditional on any variable in the information set Q, relative to which the 
model is supposed to give the conditional mean of y+, should be equal to 
zero. For any z; that belongs to Q;, therefore, we have E(z,u;) = 0 for all 
observations t. This sort of requirement, following from the hypothesis that 
M is correctly specified, is known as a moment condition. 


A moment condition is purely theoretical. However, we can often calculate 
the empirical counterpart of a moment condition and use it as the basis of a 
conditional moment test. For a linear regression, of course, we already know 
how to perform such a test: We add z; to regression (15.08) and look at the 
t statistic for this additional regressor to have a coefficient of 0. 


More generally, consider a moment condition of the form 
Eo(mi(y,9)) =0, t=1,...,7, (15.12) 


where y; is the dependent variable, and @ is the vector of parameters for the 
model M. As the notation implies, the expectation in (15.12) is computed 
using a DGP in M with parameter vector 0. The t subscript on the moment 
function m(y+,0) indicates that, in general, moment functions also depend 
on exogenous or predetermined variables. Equation (15.12) implies that the 
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mz(yt,9) are elementary zero functions in the sense of Section 9.5. We cannot 
test whether condition (15.12) holds for each observation, but we can test 
whether it holds on average. Since we will be interested in asymptotic tests, 
it is natural to consider the probability limit of the average. Thus we can 
replace (15.12) by the somewhat weaker condition 


plim, +X mi(yr,@) = 0. (15.13) 


n— Co =i 


The empirical counterpart of the left-hand side of condition (15.13) is 
t=1 


where y denotes the vector with typical element y, and Ô denotes a vector 


of estimates of 0 from the model under test. The quantity m(y, Ô) is referred 
to as an empirical moment. We wish to test whether its value is significantly 


different from zero. 


In order to do so, we need an estimate of the variance of m(y, 0). It might seem 


that, since the empirical moment is just the sample mean of the m:(yz, 0), this 
variance could be consistently estimated by the usual sample variance, 


n 


7 > (me(ye, ô) — mly, 6)). (15.15) 


t=1 


1 


n — 


If @ were replaced by the true value ĝo in expression (15.14), then we could 
indeed use the sample variance (15.15) with Ô replaced by ĝo to estimate the 
variance of the empirical moment. But, because the vector 6 is random, on 
account of its dependence on y, we have to take this parameter uncertainty 


A 


into account when we estimate the variance of m(y, 9). 


The easiest way to see the effects of parameter uncertainty is to consider 
conditional moment tests based on artificial regressions. Suppose there is an 
artificial regression of the form (15.01) in correspondence with the model M 
and the estimator @ which allows us to write the moment function mily, 9) 
as the product z:(yz,9)r:(yz,9) of a factor z, and the regressand r; of the 
artificial regression. If the number N of artificial observations is not equal 
to the sample size n, some algebraic manipulation may be needed in order 
to express the moment functions in a convenient form, but we ignore such 


problems here and suppose that N = n. 


Now consider the artificial regression of which the typical observation is 
re(yz,9) = Ri(yz,0)b + czt(yt, 0) + residual. (15.16) 
If the z satisfy conditions R1-R3, then the t statistic for c = 0 is a valid 
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test statistic whenever equation (15.16) is evaluated at a root-n consistent 
estimate of @, in particular, at Ô. By applying the FWL Theorem to this 
equation and taking probability limits, it is not difficult to see that this t sta- 
tistic is actually testing the hypothesis that 


plim + z9 Mp,ro = 0, (15.17) 


n— Co 


where zo = z(00), Ro = R(@), and Mp, is the matrix that projects orthogo- 
nally on to $+(Ry). Asymptotically, equation (15.17) is precisely the moment 
condition that we wish to test, as can be seen from the following argument: 


n'/2m(0) = n-/221(8)r(0) 
=n Vz) ro +n 129 Ro n!/?(ĝ — 0o) + O,(n-1/2) 
= n"? z) Mr, ro + Op(n-/?), (15.18) 


where for notational ease we have suppressed the dependence on the dependent 
variable. The steps leading to (15.18) are very similar to the derivation of 
a closely related result in the technical appendix, and interested readers are 
urged to consult the latter. If there were no parameter uncertainty, the second 
term in the second line above would vanish, and the leading-order term in 
expression (15.18) would simply be n7t? zo ro. 


It is clear from expression (15.18) that, as we indicated above, the asymp- 
totic variance of n*/?m(Ô) is smaller than that of n!/?m(0o), because the 
projection MR, appears in the leading-order term for the former empirical 
moment but not in the leading-order term for the latter one. The reduction 
in variance caused by the projection is a phenomenon analogous to the loss 
of degrees of freedom in Hansen-Sargan tests caused by the need to estimate 
parameters; recall the discussion in Section 9.4. Indeed, since moment func- 
tions are zero functions, conditional moment tests can be interpreted as tests 
of overidentifying restrictions. 


Examples of Conditional Moment Tests 


Suppose the model under test is the nonlinear regression model (6.01), and 
the moment functions can be written as 


mz(B) = 2(B) us(B), (15.19) 


where u:(3) = y+ — 2:(G) is the t* residual, and z(8) is some function 
of exogenous or predetermined variables and the parameters. We are using 
6B instead of 0 to denote the vector of parameter estimates here because the 
regression function is z;(). In this case, as we now show, a test of the moment 
condition (15.13) can be based on the following Gauss-Newton regression: 


A A A 


ul B) = X,(3)b + cz,(8) + residual, (15.20) 
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where 3 is the vector of NLS estimates of the parameters, and X;(@) is the 
k-vector of derivatives of x,() with respect to the elements of 8. 


Since the NLS estimator 8 is root-n consistent and asymptotically normal 
under the usual regularity conditions for nonlinear regression, all we have 
to show is that conditions R1-R3 are satisfied by the GNR (15.20). Condi- 
tion R1 is trivially satisfied, since what it requires is precisely what we wish 
to test. Condition R2, for the covariance matrix (15.03), follows easily from 
the fact that X;(G) and z;(@) depend on the data only through exogenous or 
predetermined variables. 


Condition R3 requires a little more work, however. Let z(@) and u() be the 
n-vectors with typical elements z;(3) and u;(G), respectively. The derivative 
of n-+z'(@B)u(B) with respect to any component 3; of the vector 8 is 


1 0z'(@) 0x(3) 
n OD; OB; ` 


Since the elements of z() are predetermined, so are those of its derivative 
with respect to 3;, and since u(Jo) is just the vector of error terms, it follows 
from a law of large numbers that the first term of expression (15.21) tends to 
zero as n — oo. In fact, by a central limit theorem, this term is O,(n~!/2). 
The n x k matrix X(@) has typical column 0x(3)/06;. Therefore, the Jaco- 
bian matrix of n~!z'(@)u(B) is asymptotically equal to —n~1z'(Go)X(Bo), 
which is condition R3 for the GNR (15.20). Thus we conclude that this GNR 
can be used to test the moment condition (15.13). 


u(B) — + z'(8) (15.21) 


The above reasoning can easily be generalized to allow us to test more than 
one moment condition at a time. Let Z(G) denote an n x r matrix of func- 
tions of the data, each column of which is asymptotically orthogonal to the 
vector u under the null hypothesis that is to be tested, in the sense that 
plimn~!Z'(89)u = 0. Now consider the artificial regression 


A A 


u(ĝ) = X(Ê)b + Z(B)c + residuals. (15.22) 


As readers are asked to show in Exercise 15.5, n times the uncentered R? from 
this regression is asymptotically distributed as y?(r) under the null hypo- 
thesis. An ordinary F test for c = 0 is also asymptotically valid. 


Conditional moment tests based on the GNR are often useful for linear and 
nonlinear regression models, but they evidently cannot be used when the GNR 
itself is not applicable. With models estimated by maximum likelihood, tests 
can be based on the OPG regression that was introduced in Section 10.5. This 
artificial regression applies whenever there is a Type 2 MLE 6 that is root-n 
consistent and asymptotically normal; see Section 10.3. 


The OPG regression was originally given in equation (10.72). It is repeated 
here for convenience with a minor change of notation: 


ı = G(6)b + residuals. (15.23) 
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The regressand is an n-vector of 1s, and the regressor matrix is the matrix 
of contributions to the gradient, with typical element defined by (10.26). The 
artificial regression corresponds to the model implicitly defined by the matrix 
G(0), together with the ML estimator Ô. Let m(0) be the n-vector with 
typical element the moment function m:(y,0) that is to be tested, where 
once more the notation hides the dependence on the data. Then the testing 
regression is simplicity itself: We add m(0) to regression (15.23) as an extra 


regressor, obtaining 
t= G(0)b + cm(0) + residuals. (15.24) 


The test statistic is the t statistic on the extra regressor. The regressors here 
can be evaluated at any root-n consistent estimator, but it is most common 
to use the MLE @. 


If several moment conditions are to be tested simultaneously, then we can 
form the n x r matrix M(@), each column of which is a vector of moment 
functions. The testing regression is then 


t= G(@) + M(@)c + residuals. (15.25) 


When the regressors are evaluated at the MLE 6, several asymptotically valid 
test statistics are available, including the explained sum of squares, n times 
the uncentered R?, and the F statistic for the artificial hypothesis that c = 0. 
The first two of these statistics are distributed asymptotically as y?(r) under 
the null hypothesis, as is r times the third. If the regressors in equation (15.25) 
are not evaluated at Ô, but at some other root-n consistent estimate, then only 
the F statistic is asymptotically valid. 


The artificial regression (15.23) is valid for a very wide variety of models. 
Condition R2 requires that we be able to apply a central limit theorem to 
the scalar product n~!/? m'(O)e, where, as usual, 99 is the true parameter 
vector. If the expectation of each moment function m;(@0) is zero conditional 
on an appropriate information set Q;, then it is normally a routine matter 
to find a suitable central limit theorem. Condition R3 is also satisfied under 
very mild regularity conditions. What it requires is that the derivatives of 


ntm! (0). with respect to the elements of 0, evaluated at 89, should be given 


by the elements of the vector n~!m"(@9)G(@o), up to a term of order n~1/?. 
Formally, we require that 
Z dm (0 Å. - 
IDO] =L melO) + Opa), (15.26) 
t=1 *  10=60 t=1 


where G,(0) is the tt? row of G(@). Readers are invited in Exercise 15.6 
to show that equation (15.26) holds under the usual regularity conditions 
for ML estimation. This property and its use in conditional moment tests 
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implemented by an OPG regression were first established by Newey (1985). 
It is straightforward to extend this result to the case in which we have a 
matrix M(@) of moment functions. 


As we noted in Section 10.5, many tests based on the OPG regression are prone 
to overreject the null hypothesis, sometimes very severely, in finite samples. 
It is therefore often a good idea to bootstrap conditional moment tests based 
on the OPG regression. Since the model under test is estimated by maximum 
likelihood, a fully parametric bootstrap is appropriate. It is generally quite 
easy to implement such a bootstrap, unless estimating the original model is 
unusually difficult or expensive. 


Tests for Skewness and Kurtosis 


One common application of conditional moment tests is checking the residuals 
from an econometric model for skewness and excess kurtosis. By “excess” 
kurtosis, we mean a fourth moment greater than 30+, the value for the normal 
distribution; see Exercise 4.2. The presence of significant departures from 
normality may indicate that a model is misspecified, or it may indicate that 
we should use a different estimation method. For example, although least 
squares may still perform well in the presence of moderate skewness and excess 
kurtosis, it cannot be expected to do so when the error terms are extremely 
skewed or have very thick tails. 


Both skewness and excess kurtosis are often encountered in returns data from 
financial markets, especially when the returns are measured over short periods 
of time. A good model should eliminate, or at least substantially reduce, 
the skewness and excess kurtosis that is generally evident in daily, weekly, 
and, to a lesser extent, monthly returns data. Thus one way to evaluate a 
model for financial returns, such as the ARCH models that were discussed in 
Section 13.5, is to test the residuals for skewness and excess kurtosis. 


We cannot base tests for skewness and excess kurtosis in regression models 
on the GNR, because the GNR is designed only for testing against alterna- 
tives that involve the conditional mean of the dependent variable. There is 
no way to define functions z;() that depend on parameters and exogenous 
or predetermined variables in such a way that the moment function (15.19) 
corresponds to the condition we wish to test. Instead, one valid approach 
is to test the slightly stronger assumption that the error terms are normally 
distributed by using the OPG regression. We now discuss this approach and 
show that even simpler tests are available. 


The OPG regression that corresponds to the linear regression model 
y=XB+u, u~ N(0,07D), 
where the regressors include a constant or the equivalent, can be written as 


1 u2 — g? 
1 = — u (B) Xib + bo i10) 
o Oo 


+ residual. (15.27) 
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Here uz(3) = y: — Xz, and the assumption that the error terms are normal 
implies that they are not skewed and do not suffer from excess kurtosis. To 
test the assumption that they are not skewed, the appropriate test regressor 
for observation t is just u?(3). For testing purposes, all the regressors are to 
be evaluated at the OLS estimates 6 and ô? = SSR/n. Thus an appropriate 
testing regression is 


A 


1 ^ ur ig" 
l= xp Ur (B) Xb E Og (8) 
ô ô 


+ cu? (Â) + residual. (15.28) 


This is just a special case of regression (15.24), and the test statistic is simply 
the t statistic for c = 0. 


Regression (15.28) is unnecessarily complicated. First, observe that the test 
regressor is asymptotically orthogonal under the null to the regressor that 
corresponds to the parameter ø. To see this, evaluate the regressors at the 
true Bo instead of at Ê. Then the residuals u+(8Bo) are just the error terms uz, 
and so we see that 


Tok E 
lim — 4 w} =0. 
Pa n > om : 
This result uses a law of large numbers and follows from the facts that E(u?) = 
E(u?) = 0 if us is normally distributed. Thus the t statistic for c = 0 from 
regression (15.28) is asymptotically unchanged if we simply omit the regressor 
corresponding to o. 


This t statistic is also unchanged, in finite samples, if we add to u?(@) any 
linear combination of the regressors that correspond to 8; recall the discussion 
in Section 2.4 in connection with the FWL Theorem. Thus, since we assumed 
that there is a constant term in the regression, the t statistic is unchanged if 
we replace u3(3) by u?(3) —30?°u (B). Doing so makes the new test regressor 
asymptotically orthogonal to all the regressors that correspond to 8, as can 
be seen from the following calculation: 


n—- Co 


plim Ł `> X,uz(uz — 30°) = lim 1y X,E (ut — 30°u?) = 0. 
t=1 t=1 


The second equality uses the fact that, when wu; is normal, E(u) = 304. 
Therefore, it makes no difference asymptotically if we omit the regressors 
that correspond to 8. 


The above arguments imply that we can obtain a valid test simply by using 
the t statistic from the regression 


L= c(u}(B) — 36u (b)) + residual, (15.29) 


which is numerically identical to the t statistic for the sample mean of the 
single regressor here to be 0. Because the plim of the error variance is just 1, 
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since the regressor and regressand are asymptotically orthogonal, both of these 
t statistics are asymptotically equal to 


n"? Seer (tig — 367%) 
1 


l (15.30) 
(nE (a3 — 3624,)2)7 


A 


where ti = u( 8). Since the OLS residuals from a regression that includes a 
constant sum to 0, the numerator of this expression simplifies to n~!/? > a3. 
The sixth moment of the normal distribution is 150° (see Exercise 13.19), and 
so the plim of the denominator is the square root of 


E(u — 607 ut + 9o*u?) = o (15 — 18 + 9) = 6o®. 


It follows that expression (15.30) is asymptotically equal to the much simpler 
test statistic 


73 = (6n) X e}, (15.31) 
t=1 


which is expressed in terms of the normalized residuals e; = ĉ;/ô. The asymp- 
totic distribution of the test statistic T3 is standard normal under the null 
hypothesis that the error terms are normally distributed. 


It follows from (15.31) that the variance of n~!/? Y G3 is 60°. In contrast, 
the variance of n~!/? X` uł, which is equal to the variance of už, is 150°; see 
Exercise 13.19. Thus, in this case, the reduction in variance due to parameter 
uncertainty is very considerable. 


The tt! observation of the regressor needed to test for excess kurtosis is 
ut — 30%. It is easy to check that this regressor can be made asymptotic- 
ally orthogonal to the other regressors in (15.27) without changing the t sta- 
tistic by adding 60° times the regressor corresponding to a, so as to yield 
uf —607u? + 304. Dividing this by øt also has no effect on the t statistic, and 
so running the test regression 


1 = c(e} — 6e? +3) + residual, (15.32) 


which is defined in terms of the normalized residuals, provides an appropriate 
test statistic. As readers are invited to check in Exercise 15.8, this statistic is 
asymptotically equivalent to the simpler statistic 


T4 = (24n) 1/2 Sei -— 3). (15.33) 


t=1 


It is important that the denominator of the normalized residual be the ML 
estimator ô rather than the usual least squares estimator s, as this choice 
ensures that the sum of the e? is precisely n. Like 73, the statistic 74 has an 
asymptotic N (0,1) distribution under the null hypothesis of normality. 
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The squares of 73 and 74 are widely used as a test statistics for skewness and 
excess kurtosis; they are both asymptotically distributed as ?(1). However, 
we prefer to use the statistics themselves rather than their squares, since the 
sign is informative. The statistic 73 is positive when the residuals are skewed 
to the right and negative when they are skewed to the left. Similarly, the 
statistic 74 is positive if there is positive excess kurtosis and negative if there 
is negative excess kurtosis. 


It can be shown (see Exercise 15.8 again) that the test statistics (15.31) and 
(15.33) are asymptotically independent under the null. Therefore, a joint test 
for skewness and excess kurtosis can be based on the statistic 


a= N: (15.34) 


which is asymptotically distributed as y?(2) when the error terms are normally 
distributed. The statistics 73, T4, and 73,4 were proposed, in slightly different 
forms, by Jarque and Bera (1980) and Kiefer and Salmon (1983); see also 
Bera and Jarque (1982). Many regression packages calculate these statistics 
as a matter of course. 


The statistics 73, T4, and 73,4 defined in equations (15.31), (15.33), and (15.34) 
depend solely on normalized residuals. This implies that, for a linear regres- 
sion model with fixed regressors, they are pivotal under the null hypothesis of 
normality. Therefore, if we use the parametric bootstrap in this situation, we 
can obtain exact tests based on these statistics; see the discussion at the end 
of Section 7.7. Even for nonlinear regression models or models with lagged 
dependent variables, parametric bootstrap tests should work very much better 
than asymptotic tests. 


The statistics 73, T4, and 73,4 are not valid if the regression model that fur- 
nishes the normalized residuals does not contain a constant or the equivalent. 
In such unusual cases, it is necessary to proceed differently, for instance, by 
using the full OPG regression (15.27) with one or two test regressors. The 
OPG regression can also be used to test for skewness and excess kurtosis in 
models that are not regression models, such as the models with ARCH errors 
that were discussed in Section 13.6. 


Information Matrix Tests 


In Section 10.3, we first encountered the information matrix equality. This 
famous result, which is given in equation (10.34), tells us that, for a model 
estimated by maximum likelihood with parameter vector 0, the asymptotic 
information matrix, J(@), is equal to minus the asymptotic Hessian, H(@). 
The proof of this result, which was given in Exercises 10.6 and 10.7, depends 
on the DGP being a special case of the model. Therefore, we should expect 
that, in general, the information matrix equality does not hold when the model 
we are estimating is misspecified. This suggests that testing this equality is 
one way to test the specification of a statistical model. This idea was first 
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suggested by White (1982), who called tests based on it information matrix 


tests, or IM tests. These tests were later reinterpreted as conditional moment 
tests by Newey (1985) and White (1987). 


Consider a statistical model characterized by the loglikelihood function 


0)=S_ &(y',8), 
t=1 
in the standard notation of equation (10.25). The null hypothesis for the IM 


test is that i 
9?4(0) _ 2L (0) 240) 
li = 15. 

Pe" > G 00; | 90; 30; ý 2) 


for i= 1,...,k and j =1,...,%. Expression (15.35) is a typical element of the 
information matrix equality. The first term is an element of the asymptotic 
Hessian, and the second term is the corresponding element of the outer prod- 
uct of the gradient, the expectation of which is the asymptotic information 
matrix. Since both these matrices are symmetric, there are sk(k +1) distinct 
conditions of the form (15.35). 

Equation (15.35) is a conditional moment in the form (15.13). We can there- 
fore calculate IM test statistics by means of the OPG regression, a proce- 
dure that was originally suggested by Chesher (1983) and Lancaster (1984). 
The matrix M(@) that appears in regression (15.25) is constructed as an 
n x $k(k +1) matrix with typical element 


070,(0)  ƏL(0) de,(8) 
00:00; 00; 00; ` 


(15.36) 


This matrix and the other matrix of regressors G(@) in (15.25) are usually 
evaluated at the ML estimates Ô. The test statistic is then the explained 
sum of squares, or, equivalently, n — SSR from this regression. If the matrix 
(G(6) M(6] has full rank, this test statistic is asymptotically distributed 
as X 2(4 k(k + 1)). If it does not have full rank, as is the case for linear 
regression models with a constant term, one or more columns of M (Ê) have 
to be dropped, and the number of degrees of freedom for the test reduced 


accordingly. 


In Exercise 15.11, readers are asked to develop the OPG version of the infor- 
mation matrix test for a particular linear regression model. As the exercise 
shows, the IM test in this case is sensitive to excess kurtosis, skewness, skew- 
ness interacted with the regressors, and any form of heteroskedasticity that 
the test of White (1980) would detect; see Section 7.5. This suggests that we 
might well learn more about what is wrong with a regression model by testing 
for heteroskedasticity, skewness, and kurtosis separately instead of performing 
an information matrix test. We should certainly do that if the IM test rejects 
the null hypothesis. 
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As we have remarked before, tests based on the OPG regression are extremely 
prone to overreject in finite samples. This is particularly true for information 
matrix tests when the number of parameters is not small; see Davidson and 
MacKinnon (1992, 1998). Fortunately, the OPG variant of the IM test is by 
no means the only one that can be used. Davidson and MacKinnon (1998) 
compare the OPG version of the IM test for linear regression models with 
two other versions. One of these is the efficient score, or ES, variant (see 
Section 10.5), and the other is based on the double-length regression, or DLR, 
originally proposed by Davidson and MacKinnon (1984a). They also compare 
the OPG variant of the IM test for probit models with an efficient score 
variant that was proposed by Orme (1988). Although the DLR and both ES 
versions of the IM test are much more reliable than the corresponding OPG 
versions, their finite-sample properties are far from ideal, and they too should 
be bootstrapped whenever the sample size is not extremely large. 


15.3 Nonnested Hypothesis Tests 


Hypothesis testing usually involves nested models, in which the model that 
represents the null hypothesis is a special case of a more general model that 
represents the alternative hypothesis. For such a model, we can always test the 
null hypothesis by testing the restrictions that it imposes on the alternative. 
But economic theory often suggests models that are nonnested. This means 
that neither model can be written as a special case of the other without 
imposing restrictions on both models. In such a case, we cannot simply test 
one of the models against the other, less restricted, one. 


There is an extensive literature on nonnested hypothesis testing. It provides 
a number of ways to test the specification of statistical models when one or 
more nonnested alternatives exists. In this section, we briefly discuss some of 
the simplest and most widely-used nonnested hypothesis tests, primarily in 
the context of regression models. 


Testing Nonnested Linear Regression Models 


Suppose we have two competing economic theories which imply different linear 
regression models for a dependent variable y; conditional on some information 
set. We can write the two models as 


Hı: y=XB+u,, and 


(15.37) 
Hə: y= Zy +u. 


Here y is an n-vector with typical element y+, and the regressor matrices X 
and Z, which contain exogenous or predetermined variables, are n x kı and 
n X kg, respectively. For simplicity, we will assume that, if the hypothesis H; 
holds, then E(u;u; ) = o? I, for i = 1,2. Thus OLS estimation is appropriate 
for whichever model actually generated the data, and we can base inferences 
on the usual OLS covariance matrix. 
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For the models H; and Hə given in equations (15.37) to be nonnested, it must 
be the case that neither of them is a special case of the other. This implies 
that S(X) cannot be a subspace of 8(Z), and vice versa. In other words, 
there must be at least one regressor among the columns of X that does not 
lie in 8(Z), and there must be at least one regressor among the columns of Z 
that does not lie in 8(X). We will assume that this is the case. 


The simplest and most widely-used nonnested hypothesis tests start from the 
artificial comprehensive model 


y=(l-a)XB+aZy+ 4, (15.38) 


where a is a scalar parameter. When a = 0, equation (15.38) reduces to Hy, 
and when a = 1, it reduces to Hə. Thus it might seem that, to test Hı, we 
could simply estimate this model and test whether a = 0. However, this is 
not possible, because at least one, and usually quite a few, of the parameters 
of equation (15.38) cannot be identified. There are kı + k2 + 1 parameters in 
the regression function of the artificial model, but the number of parameters 
that can be identified is the dimension of the subspace §(X, Z). This cannot 
exceed kı + kg and is usually smaller, because some of the regressors, or linear 
combinations of them, may appear in both regression functions. 


The simplest way to base a test on equation (15.38) is to estimate a restricted 
version of it that is identified, namely, the inclusive regression 


y= XB 4Z'y' +u, (15.39) 


where the n x k4 matrix Z’ consists of the ki columns of Z that do not lie in 
§(X). Thus 8(X, Z) = 8(X, Z’), and the dimension of this space is kı + k5. 
We can estimate the model (15.39) by OLS and test the null hypothesis that 
y’ = 0 by using an ordinary F test with kí and n — kı — k4 degrees of freedom. 
This provides an easy and reliable way to test Hı. 


Although the F test for +’ = O in the inclusive regression (15.39) has much 
to recommend it, it is not often thought of as a nonnested hypothesis test, 
and it does not generalize in a very satisfactory way to the case of nonlinear 
regression models. Moreover, it is generally less powerful than the nonnested 
hypothesis tests that we are about to discuss when Hə actually generated the 
data. We will have more to say about this test below. 


Another way to make equation (15.38) identified is to replace the unknown 
vector y by a vector of parameter estimates. This idea was first suggested by 
Davidson and MacKinnon (1981), who proposed that ~y be replaced by ¥, the 
vector of OLS estimates of the Hə model. Thus, if 8 is redefined appropriately, 
equation (15.38) becomes 


y= XB+aZzy+u 


(15.40) 
= XB + aPzy+ u, 
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where, as usual, Pz denotes the matrix Z(Z'Z)!1Z'. This leads to the 
nonnested hypothesis test that Davidson and MacKinnon called the J test. It 
is based on the ordinary t statistic for œ = 0 in equation (15.40), which they 
called the J statistic.! 


It is not at all obvious that the J statistic is asymptotically distributed as 
N(O,1) under the null hypothesis that the data were generated by Hy. After 
all, as can be seen from the second equation of (15.40), the test regressor 
depends on the regressand. Thus one might expect the regressand to be 
positively correlated with the test regressor, even when the null hypothesis is 
true. This is generally the case, but only in finite samples. The proof that 
the J statistic is asymptotically valid depends on the fact that, under the null 
hypothesis, the numerator of the test statistic is 


y Mx Pzy = u' Mx Pz XB) + u' Mx Pzu, (15.41) 


where 6o is the true parameter vector. The left-hand side of this equation 
can easily be obtained by applying the FWL Theorem to the second line of 
equation (15.40). The right-hand side follows when we replace y by Xo + u. 
There are only two terms on the right-hand side of the equation, because 
o X'Mx = 0. 

The first term on the right-hand side of equation (15.41) is a weighted average 
of the elements of the vector u. Under standard regularity conditions, we may 
apply a central limit theorem to it, with the result that this term is Op(nt/ ey, 
In contrast, the second term is O,(1), as can be seen from the following: 


u'Mx Pzu=u'Pzu—u'PxPzu 
=n Pal Z(n ZZ) in PZ u 
=A a XX XY Zt ni 
Since the error terms from the Hı model are uncorrelated with the regressors 
of the Hz model when the former is true, we can apply a central limit theorem 
to both n~!/? Xu and n~!/2Z"u, so that these expressions are both O,(1). 
So too, under standard regularity conditions, are the cross-product matrices of 
the form n~!'W TW, where W stands for either X or Z. It follows that n~!/? 


times the numerator of the J statistic has the same asymptotic distribution 
as n—'/? times the first term in (15.41). This distribution is 


N(0,n~'o} Bo X'PzMx Pz Xo). (15.42) 
It can be shown that n~! times the square of the denominator of the test 


! This J statistic should not be confused with the Hansen-Sargan statistic dis- 
cussed in Section 9.4, which some authors refer to as the J statistic. 
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statistic consistently estimates the variance that appears in expression (15.42); 
see Exercise 15.12. The J statistic itself is therefore asymptotically distributed 
as N(0,1) under the null hypothesis. 


Although the J test is asymptotically valid, it generally is not exact in finite 
samples, although there is an exception in one very special case, which is 
treated in Exercise 15.13. In fact, because the second term on the right-hand 
side of equation (15.41) usually has a positive expectation under the null, 
the numerator of the J statistic generally has a positive mean, and so does 
the test statistic itself. In consequence, the J test tends to overreject, often 
quite severely, in finite samples. Theoretical results in Davidson and MacKin- 
non (2002a), which are consistent with the results of simulation experiments 
reported in a number of papers, suggest that the overrejection tends to be 
particularly severe when at least one of the following conditions holds: 


e The sample size is small; 
e The model under test does not fit very well; 
e The number of regressors in Hz that do not appear in H; is large. 


Bootstrapping the J test dramatically improves its finite-sample performance. 
The bootstrap data may be generated under H; using either a fully parametric 
or a semiparametric bootstrap DGP, as discussed in Section 4.6. If the latter is 
used, it is very important to rescale the residuals before they are resampled. In 
most cases, the bootstrap J test is quite reliable, even in very small samples; 
see Godfrey (1998) and Davidson and MacKinnon (2002a). An even more 
reliable test may be obtained by using a more sophisticated bootstrapping 
procedure proposed by Davidson and MacKinnon (2002b). 


Another way to obtain a nonnested test that is more reliable than the asymp- 
totic J test in finite samples is to replace ¥ in the first line of equation (15.40) 
by another estimate of y, namely, 


4 =(Z'Z)'Z'Pxy. (15.43) 


This estimate may be obtained by regressing Pxy on Z. It is an estimate of 
the expectation of 4 when H; actually generates the data. The test regression 
is then 

y = XB8+aZyğ+u 


(15.44) 
= XB +aPzPxy+u, 


and the test statistic is, once again, the t statistic for œ = 0. This test 
statistic, which was originally proposed by Fisher and McAleer (1981), is 
called the Ja statistic. The resulting J4 test has much better finite-sample 
properties under the null hypothesis than the ordinary J test. In fact, the test 
is exact whenever both the Hı and Hj models satisfy all the assumptions of 
the classical normal linear model, for exactly the same reason that the RESET 
test is exact in a similar situation; see Godfrey (1983) and Exercise 15.3. 
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Unfortunately, the excellent performance of the J4 test under the null is not 
accompanied by equally good performance under the alternative. As can be 
seen from the second of equations (15.44), the vector y is projected onto X 
before y is estimated. In consequence, y may differ greatly from Ẹ when H; is 
false, and evidence that the Hı model is incorrect may therefore be suppressed. 
Simulation experiments have shown that the J, test can be very much less 
powerful than the J test; see, for example, Davidson and MacKinnon (1982). 
A rejection by the Ją test should be taken very seriously, but a failure to 
reject provides little information. In contrast, the J test, when bootstrapped, 
appears to be both reliable and powerful in samples of reasonable size. 


The J and J, tests are by no means the only nonnested tests that have 
been proposed for linear regression models. In particular, several tests have 
been based on the pioneering work of Cox (1961, 1962), which we will discuss 
further below. The most notable of these were proposed by Pesaran (1974) and 
Godfrey and Pesaran (1983). However, since these tests are asymptotically 
equivalent to the J test, have finite-sample properties that are either dreadful 
(for the first test) or mediocre (for the second one), and are more complicated 
to compute than the J test, especially in the case of the second one, there 
appears to be no reason to employ them in practice. 


Testing Nonnested Nonlinear Regression Models 


The J test can readily be extended to nonlinear regression models. Suppose 
the two models are 

1: y x(B) + u, and (15.45) 

Hə: y=2(y)+ uz. 

When we say that these two models are nonnested, we mean that there are 
values of 3, usually infinitely many of them, for which there is no admissible ~y 
for which 2(3) = z(y), and, similarly, values of y for which there is no 
admissible 8 such that z(y) = «x(@). In other words, neither model is a 
special case of the other unless we impose restrictions on both models. The 
artificial comprehensive model analogous to equation (15.38) is 


y =(1—a)ax(B) + az(y) +u, 
and the J statistic is the t statistic for a = 0 in the nonlinear regression 
y = (1 — a)x(8B) + az + residuals, (15.46) 


where 2 = 2z(¥), ¥ being the vector of NLS estimates of the regression 
model Hy. It can be shown that, under suitable regularity conditions, this 
test statistic is asymptotically distributed as N(0,1) under H1; see Davidson 
and MacKinnon (1981). 


Because some of the parameters of the nonlinear regression (15.46) may not be 
well identified, the J statistic can be difficult to compute. This difficulty can 


Copyright © 1999, Russell Davidson and James G. MacKinnon 


660 Testing the Specification of Econometric Models 


be avoided in the usual way, that is, by running the GNR which corresponds 
to equation (15.46), evaluated at a = 0 and 8 = 8. This GNR is 


y —& = Xb+a(z—#) + residuals, (15.47) 


where ĉ = x(ĝ), and X = X(B) is the matrix of derivatives of #(8) with 
respect to 8, evaluated at the NLS estimates 3. The ordinary t statistic for 
a = 0 in regression (15.47) is called the P statistic. Under the null hypothesis, 
it is asymptotically equal to the corresponding J statistic. The P test is much 
more commonly used than the J test when the Hı model is nonlinear. 


Numerous other nonnested tests are available for nonlinear regression models. 
These include the P4 test, which is related to the P test in precisely the 
same way as the J, test is related to the J test in the case of linear models. 
Because H; is nonlinear, the P4 test may not be particularly reliable in finite 
samples, and, like the J4 test, it can suffer from a serious lack of power. In 
contrast, a bootstrap version of the P test should be reasonably reliable and 
quite powerful. We therefore recommend using it rather than the P4 test if 
computer time is not a constraint. 


The J and P tests can both be made robust to heteroskedasticity of unknown 
form either by using heteroskedasticity-robust standard errors (Section 5.5) or 
by using the HRGNR (Section 6.8). Like ordinary J and P tests, these tests 
should be bootstrapped. However, bootstrapping heteroskedasticity-robust 
tests requires procedures different from those used to bootstrap ordinary t 
and F tests, because the bootstrap DGP has to preserve the relationship 
between the regressors and the variances of the error terms. This means that 
we cannot use IID errors or resampled residuals. For introductory discussions 
of bootstrap methods for regression models with heteroskedastic errors, see 
Horowitz (2001) and MacKinnon (2002). 


It is straightforward to extend the J and P tests to handle more than two 
nonnested alternatives. For concreteness, suppose there are three competing 
models. Then a J test of Hı could be based on an F statistic for the joint 
significance of the fitted values from Hə and H3 when they are added to the 
regression for Hı. Similarly, a P test of Hı could be based on an F statistic for 
the joint significance of the difference between the fitted values from Hə and 
Hı, and the difference between the fitted values from H3 and Hı, when they 
are both added to the GNR for Hı evaluated at the least squares estimates 
of that model. 


The P test can also be extended to linear and nonlinear multivariate regression 
models; see Davidson and MacKinnon (1983). One starts by formulating an 
artificial comprehensive model analogous to (15.38), with just one additional 
parameter, replaces the parameters of the Hə model by suitable estimates, 
and then obtains a P test based on the multivariate GNR (12.53) for the 
model under test. Because there is more than one plausible way to specify 
the artificial comprehensive model, more than one such test can be computed. 
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Interpreting Nonnested Tests 


All of the nonnested hypothesis tests that we have discussed are really just 
specification tests of the Hı model from either equations (15.37) or (15.45). 
If we reject the null hypothesis, there is no implication that the Hə model is 
true. To say anything about the validity of the Hz model, we need to test it. 
This can be done by interchanging the roles of the two models. For example, 
the J test of Hə in the linear case would be based on the regression 


y =Zy+a'XB+u 


(15.48) 
= Zy +a'Pxy + u, 
where a’ = 1 — a. The J statistic would then be the ordinary t statistic for 
a’ = 0 in regression (15.48). 


When we perform a pair of nonnested tests, testing each of Hı and H2 against 
the other, there are four possible outcomes: 


e Reject Hı but do not reject A; 
e Reject Hə but do not reject Ay; 
e Reject both models; 

e Do not reject either model. 


Since the first two outcomes lead us to prefer one of the models, it is tempting 
to see them as natural and desirable. However, the last two outcomes, which 
are by no means uncommon in practice, can also be very informative. If both 
models are rejected, then we need to find some other model that fits better. 
If neither model is rejected, then we have learned that the data appear to be 
compatible with both hypotheses. 


Because nonnested hypothesis tests are designed as specification tests, rather 
than as procedures for choosing among competing models, it is not at all 
surprising that they sometimes do not lead us to choose one model over the 
other. If we simply want to choose the “best” model out of some set of 
competing models, whether or not any of them is satisfactory, then we should 
use a completely different approach, based on what are called information 
criteria. This approach will be discussed in the next section. 


Encompassing Tests 


If the true DGP belongs to model H4, then it should be possible to derive the 
properties of parameter estimates from model Hə in terms of the properties of 
model Hı. This is the idea behind what are called encompassing tests. It is 
very similar to the idea behind indirect inference, a topic we briefly discussed 
in Section 13.3. Binding functions, as defined in the context of indirect infer- 
ence, specify the plim of the parameter estimates from model Hə in terms of 
the parameters of the true DGP, which is assumed to be in Hı. Thus a test 
of Hı can be based on a comparison of the actual Ho parameter estimates and 
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estimates of the values of the binding functions under the assumption that 
H; generated the data. 


As a concrete example, consider the linear case in which the two models 
are given by equations (15.37). If the DGP is a special case of Hı with 
parameters Go, the binding functions evaluated at Bo give the plim of the 
vector y obtained by estimating Hə. Since the columns of Z are assumed to 
be exogenous or predetermined, we see that 


—1 
plim ¥ = ( plim 1 Z'Z) ( plim z Z' Xo) . 


n—> oo n—> oo n— CO 


We can estimate this probability limit by dropping the plims on the right- 
hand side and replacing Bo by B. Doing so yields the estimator + defined in 
equation (15.43). An encompassing test can therefore be based on the vector 
of contrasts between Ẹ and ¥. This vector is 


(Z'Z)1Z'y—(Z'Z)'1Z'Pxy = (Z'Z)'1Z'Mxy. (15.49) 


The leading factor (Z'Z)~+ has no effect on the test, because it is just a 
square matrix of full rank. Since some columns of Z generally lie in 8(X), 
some of the columns of the matrix Z' Myx usually are identically zero. Thus, 
as before, we let Z’ denote the remaining columns of Z. Then what we really 
want to test is whether the plim of the vector n~! Z’! Myy is zero. This calls 
for a conditional moment test. Since the model H; is linear, such a test can 
be implemented without an explicit GNR simply by using the columns of Z’ 
as test regressors, that is, by using the inclusive regression (15.39) as a test 
regression. The test statistic is just the F statistic for y’ = 0 in (15.39), which 
we have already discussed. 


The parallels between this sort of encompassing test and the DWH test dis- 
cussed in Section 8.6 are illuminating. Both tests can be implemented as 
F tests—in the case of the DWH test, an F test based on regression (8.77). 
In both cases, the F test almost always has fewer degrees of freedom in the 
numerator than the number of parameters. The interested reader may find it 
worthwhile to show explicitly that a DWH test can be set up as a conditional 
moment test. 


For a detailed discussion of the concept of encompassing and various tests 
that are based on it, see Hendry (1995, Chapter 14). Encompassing tests are 
available for a variety of nonlinear models; see Mizon and Richard (1986). 
However, there can be practical difficulties with these tests. These difficulties 
are similar to the ones that can arise with Hausman tests which are based 
directly on a vector of contrasts; see Section 8.6. The basic problem is that it 
can be difficult to ascertain the dimension of the space analogous to 8(X, Z), 
and, in consequence, it can be difficult to determine the appropriate number 
of degrees of freedom for the test. 
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Cox Tests 


Nonnested hypothesis tests are available for a large number of models that are 
not regression models. Most of these tests are based on one of two approaches. 
The first approach, which previously led to the J and P tests, involves forming 
an artificial comprehensive model and then replacing the parameters of the 
Hə model by estimates that are asymptotically nonstochastic. As an example 
of this approach, Exercise 15.19 asks readers to derive a test similar to the 
P test for binary response models. The second approach, which we briefly 
discuss in this subsection, is based on two classic papers by Cox (1961, 1962). 
It leads to what are generally called Cox tests. 


Suppose the two nonnested models are each to be estimated by maximum 
likelihood, and that their loglikelihood functions are 


4 (01) — 5 0: and lə(02) = S5 ta(02), (15.50) 


for models Hı and Ho, respectively. The notation, which is similar to that 
used in Chapter 10, omits the dependence on the data for clarity. Cox’s 
original idea was to extend the idea of a likelihood ratio test, and so he 
considered what would be the LR statistic if Hı were nested in H2, namely, 
2(l2(82) — 4 (ô1)), where 6; and Ê> are the ML estimates of the two models. 


The statistical properties of the LR statistic are quite different when Hı 
and Hə are nonnested rather than nested. In particular, it is necessary to 
divide the statistic by n'/? in order to obtain a random variable with a well- 
defined asymptotic distribution. It is then convenient to center this variable 
by subtracting its expectation. Since, according to equations (15.50), both 
£,(0,) and )(@2) are sums of contributions, it is reasonable to suppose that 
the expression 


2n-/?(ly(62) — &4(61)) ~2n-VBo,(l2(62) ~6(6)) (15.51) 


is asymptotically normal, where the notation Eg, denotes an expectation 
taken under the DGP in the Hı model with parameter vector 0}. 


Since the parameter vector 8; is not known, the expectation in (15.51) cannot 
be calculated. It is natural to estimate it by replacing the true 0; by the ML 
estimate 61, but then we face the problem of parameter uncertainty if we wish 
to estimate the variance of the result. Cox solved this problem by showing 
that the statistic 


Ti = Qn71/2 (€2(02) = Ly (6:)) = 2n-\/? Bg (l2(2) = h (61)) (15.52) 


is indeed asymptotically normally distributed, with mean 0 and a variance 
that can be estimated consistently using a formula given in his 1962 paper. 
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It turns out that the statistic (15.52) is unnecessarily complicated. As readers 
are invited to check in Exercise 15.20, 


ye (aô) = Bg, (£:(61))) = O,(n-*/?) 


under the usual regularity conditions for a Type 2 MLE. Thus the T; statistic 
is asymptotically equivalent to the simpler statistic 


T =2n71/? (tô) = Eg, (&2(82))). (15.53) 


It can be seen from (15.53) that the Cox test is in fact an encompassing test, 
in which the maximized loglikelihood function for model Hə is compared with 
its expectation under the DGP in model Hı with parameter vector 01. The 
expectation Eo, (£2(82)) can be interpreted naturally as the binding function 


for bo (02). 


The Cox test can also be interpreted as a conditional moment test. The 
moment condition can be written as 


n 


plimo, $ X (é21(62) — Bo,(¢2e(62))) = 0, 


t=1 


and the empirical moment as 
D (¢o:(62) = 4, (€21(62))). 
t=1 


These expressions make it clear that the contributions los (ô2) are treated as 
functions of the data alone. Thus the moment depends on the parameters 0; 
of Hı, the model under test, only through the expectation. 


The conditional moment interpretation leads naturally to an implementa- 
tion of the Cox test by artificial regression. The easiest one to set up is, as 
usual, the OPG regression. Since there is only one test regressor, it takes 
the form (15.24). The matrix G in (15.24) is the matrix of contributions to 
the gradient for the Hı model, evaluated at ô. The test regressor can be 
expressed in two different ways, which lead to asymptotically, but not numer- 
ically, equivalent statistics. The typical element can be either 


lo(ô2) = + Eg, (2(62)) or b54(02) — Bg, (421(2)). (15.54) 


It can be shown easily enough that both choices satisfy conditions R1—R3. 
The first choice may be easier to compute, since there is only one expectation, 
whereas the second choice requires the computation of n expectations. 


For regression models, there is a close relationship between Cox tests and tests 
based on artificial comprehensive models. Cox tests for linear and nonlinear 
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regression models were derived by Pesaran (1974) and Pesaran and Deaton 
(1978), respectively. These tests were shown to be asymptotically equivalent 
to the corresponding J and P tests by Davidson and MacKinnon (1981).? 


The only difficulty involved in calculating a Cox test, in general, is obtaining 
the expectation of (02) under H,. Since the test is valid only asymptotically, 
it is legitimate to replace the expectation in the first expression in (15.54) by 
a probability limit, which may be simpler to evaluate analytically than the 
expectation. In cases in which no analytic expression is available, we may 
evaluate any of the expectations in (15.54) by simulation. After estimating 
the Hı model, we generate S sets of simulated data from the DGP with 
parameter vector 01, estimate Hə using the simulated data, and then estimate 


A 


the expectation of /2;(0) as 


S 
1 & 
s=] 


where 03, denotes the estimate of 02 based on the st? set of simulated data. 
The expectation of /2(02) is obtained by summing expression (15.55) over t. 


As we have remarked before, the OPG regression does not have very good 
finite-sample properties. This suggests that it is generally wise to bootstrap 
any test based on it. When the expectation of ¢2 (02) can be calculated without 
simulation, this often poses no serious difficulty. However, if we have to use 
simulation, bootstrapping involves estimating the Hə model S$ + 1 times for 
each of B bootstrap samples, and this may be computationally demanding. 


Our discussion of nonnested hypothesis testing has necessarily omitted many 
topics. Survey articles on this subject include Gouriéroux and Monfort (1994), 
McAleer (1995), and Pesaran and Weeks (2001). In general, nonnested tests 
based on asymptotic theory have poor finite-sample properties. It is therefore 
desirable to bootstrap them in many, if not most, cases. However, except 
for tests of linear regression models (Davidson and MacKinnon, 2002a), not 
much is known about the finite-sample properties of bootstrapped nonnested 
hypothesis tests. 


15.4 Model Selection Based on Information Criteria 


As we remarked in the previous section, testing each of two nonnested models 
against the other may or may not allow us to choose one model over the other. 
More generally, if we have m models and perform m(m — 1) pairwise tests, 
we cannot reasonably expect to find that one and only one of the models is 
never rejected. Thus, if our objective is to choose the best model out of the 


2 The negative of the Cox statistic, as formulated in these papers, is asymptot- 
ically equal to the corresponding J or P test. 
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m competing models, and we do not care whether even the best model is 
false, we should not use nonnested hypothesis tests. Instead, we should use a 
procedure explicitly designed for model selection. Such a procedure generally 
involves calculating some sort of criterion function for each of the models and 
picking the model for which that function is maximized or minimized. 


For concreteness, suppose that, for the same dependent variable or variables, 
we have m competing models that are estimated by maximum likelihood, 
ordinary least squares, or nonlinear least squares. Let 0; be the k;-vector of 
parameters for the it” model, and let ¢;(6;) denote the maximized value of the 
loglikelihood function for that model, which we may take to be —inlog SSR 
in the case of models estimated by least squares. It might seem natural to 
pick the model with the largest value of 6;(0;). However, if the models are 
nested, this simply leads us to pick the model with the greatest number of 
parameters, even when other models fit almost as well. This violates the 
principle that, when each one of a set of nested models is correctly specified, 
we should prefer the one that has fewest parameters to estimate. This model 
is called the most parsimonious model of the set. With nonnested models, 
it is not necessarily the case that the least parsimonious of them yields the 
greatest value of the loglikelihood function, but, whenever k; > kj, model i 
plainly has an advantage over model j and therefore tends to be chosen too 
often when parsimony is a concern. 


To avoid this problem, we evidently need to penalize models with a large 
number of parameters. This idea leads to various criterion functions that can 
be used to rank competing models. The most widely used of these is probably 
the Akaike information criterion, or AIC (Akaike, 1973). There is more than 
one version of the AIC. For model i, the simplest is 


AIC; = 2;(6;) — ki. (15.56) 


Thus we reduce the loglikelihood function of each model by 1 for every esti- 
mated parameter, and we then choose the model that maximizes AIC;. The 
original form of the AIC is equivalent to (15.56) but a bit more complicated, 
and it is supposed to be minimized instead of maximized. Users of black-box 
software packages should make sure that they understand precisely what is 
being printed if a package prints what it calls the AIC. 


The AIC does not always respect the need for parsimony any more than 
the maximized loglikelihood function. Consider two nested models, Hı and 
Hə, with k and k + 1 parameters, respectively. Asymptotically, twice the 
difference between the two loglikelihood functions is distributed as x7(1) if 
H; is correctly specified. Therefore, the probability that AIC2 is greater than 
AIC; tends in large samples to the probability mass in the right-hand tail of 
the y?(1) distribution beyond 2, which is 0.1573. Thus, even with an infinitely 
large sample, we choose the less parsimonious model nearly 16% of the time. 
This example illustrates a general problem. Whenever two or more models 
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are nested, the AIC may fail to choose the most parsimonious of those that 
are correctly specified. If all the models are nonnested, and only one is well 
specified, the AIC chooses that one asymptotically, but so does simply picking 
the model with the largest value of the loglikelihood function. 


A popular alternative to the AIC, which avoids the problem discussed in the 
preceding paragraph, is the Bayesian information criterion, or BIC, which 
was proposed by Schwarz (1978). For model 7, the BIC is 


BIC; = 6;(8;) — $ ki logn. (15.57) 


The factor of logn in the penalty term ensures that, as n — oo, the penalty 
for having an additional parameter becomes very large. Thus, asymptotically, 
there is no danger of choosing an insufficiently parsimonious model. If we com- 
pare a false but parsimonious model Hə with a correctly-specified model Hı 
that may have more parameters, the BIC chooses Hı asymptotically, since, 
as readers are asked to check in Exercise 15.24, the difference BIC, — BIC» 
tends to infinity with the sample size. 


It is possible to extend the Akaike and Bayesian information criteria to models 
that are not estimated by maximum likelihood or least squares. See Andrews 
and Lu (2001) for a detailed discussion in the context of GMM estimation. 
The penalty terms depend on the number of overidentifying restrictions rather 
than on the number of parameters only. These penalty terms are twice as large 
as the ones that appear in equations (15.56) and (15.57), because likelihood 
ratio tests (Section 10.5) involve a factor of two, while tests based on GMM 
criterion functions (Section 9.4) do not involve such a factor. 


15.5 Nonparametric Estimation 


Estimation by nonparametric methods has become an area of major interest 
in both statistics and econometrics over the past twenty-five years. The term 
“nonparametric” can have more than one meaning. We use it here rather 
loosely to refer to a variety of estimation techniques that do not explicitly 
involve estimating parameters. We first discuss nonparametric density esti- 
mation and then move on to discuss nonparametric regression. Nonparametric 
methods can be used to provide alternatives against which to test parametric 
models, and we briefly discuss this sort of test at the end of the section. 


We have already encountered a few nonparametric estimators. In particular, 
the HAC estimators that were introduced in Section 9.3 are explicitly non- 
parametric. Another example is the empirical distribution function, or EDF, 
which was introduced in Section 4.5. As we saw there, if a sample is drawn 
from some univariate distribution, then the EDF consistently estimates the 
cumulative distribution function, or CDF. Since resampling from residuals is 
equivalent to drawing values randomly from the EDF, as we saw in Section 4.6, 
many bootstrap methods implicitly make use of nonparametric estimates. 
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The probability density function (PDF) associated with a given distribution 
is the derivative of the CDF, if the derivative exists. Since an EDF is, by 
construction, a discontinuous function, its derivative does not exist at the 
points of discontinuity. Elsewhere, an EDF is locally constant, and so, at those 
points where the derivative exists, it is zero. Thus, if we wish to estimate a 
density, we clearly cannot do so by differentiating the EDF. 


Estimation of Density Functions 


One traditional way of estimating a PDF is to form a histogram. Given a 


sample z;, t = 1,...,n, of independent realizations of a random variable X, 
the interval containing the x; is partitioned into a set of subintervals by a set 
of points z;, i = 1,...,m, with z; < zj for i < j, where m is typically much 


smaller than n. Like the EDF, the histogram is a locally constant function 
with discontinuities. Unlike the EDF, the histogram is discontinuous at the z4, 
not the x. For arbitrary argument zv, let 7 be such that z; < £ < zi+1. Then 
the histogram is defined as 


n 


To=. E (15.58) 


t=1 S41 T Ži 


where, as usual, I(-) denotes an indicator function, and the notation f(x) is 
motivated by the fact that the histogram is an estimate of a density function. 
Thus the value of the histogram at x is the number of sample points contained 
in the same bin as z, divided by the length of the bin, that is, the length of 
the segment [z;, 2:41]. It is thus quite precisely the density of sample points 
in that segment. 


The histogram (15.58) is entirely dependent on the choice of the partitioning 
points z;. If there were only one segment, [z1, z2], covering the whole range 
of the sample, then the histogram would be constant over that range, and 
the estimated density would therefore correspond to a uniform distribution. 
If the partition were exceedingly fine, with a value of m much greater than 
the sample size, then most bins would be empty, and the histogram would 
be equal to 0 for values of x in those bins. For the bins that contained one 
or more points, the value of the histogram would be very large, since the 
denominator z;41 — z; would tend to zero as the partition became finer. 


In the limit with just one bin, the histogram is completely smooth, being con- 
stant over the sample range. In the other limit of an infinite number of bins, 
the histogram is completely unsmooth, its values alternating between zero 
and infinity. Neither limit is at all useful. What we seek is some intermedi- 
ate degree of smoothness. More sophisticated methods of density estimation, 
which we introduce in the next subsection, must, like the histogram, make a 
choice of how smooth the estimated density should be. The choice depends 
on what is called the bandwidth, or window width, which corresponds to the 
width of a typical segment for a histogram. 
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Kernel Estimation 


The empirical distribution function, or EDF, of a sample was first defined in 
Section 4.5. The definition, which is repeated here for convenience, is 


F(a) == D I(x; < 2). (15.59) 


The discontinuous indicator function I(x; < x), or equivalently I(x > x+), can 
be interpreted as the CDF of a degenerate random variable which puts all its 
probability mass on x+, and the EDF can then be thought of as the unweighted 
average of these CDFs. As is clear from Figure 4.6, such a discontinuous EDF 
can, when graphed, provide the appearance of a smooth approximation to a 
CDF when the sample size is moderately large. But the interpretation of the 
indicator functions as CDFs suggests that we can obtain a genuinely smooth 
estimate of the CDF by replacing the discontinuous function I(x > x+) by a 
continuous CDF, with support in an interval containing z+. 


Let K(x) be any continuous CDF corresponding to a distribution with mean 0. 
This function is called a cumulative kernel. It usually corresponds to a dis- 
tribution that is symmetric around the origin, such as the standard normal. 
Then a smooth estimate of the CDF could be obtained by replacing the in- 
dicator function I(x; < x) in equation (15.59) by K(x — x+). It is convenient 
to be able to control the degree of smoothness of the estimate. Accordingly, 
we set the variance of the distribution characterized by K(x) equal to 1 and 
introduce the bandwidth parameter h as a scaling parameter for the actual 
smoothing distribution. This gives the kernel CDF estimator 


F,(z) = Ly «(2 >), (15.60) 


Evidently, this estimator depends on the choice of the cumulative kernel and 
on the bandwidth. As h tends to zero, it is easy to see that a typical term 
of the summation on the right-hand side tends to I(a > a), and so Fp (£x) 
tends to the EDF F(x) as h > 0. At the other extreme, as h becomes large, a 
typical term of the summation tends to the constant value K(0), which makes 
the kernel estimator Ê n(x) very much too smooth. In the usual case in which 
K(x) is symmetric, F),(x) tends to 0.5 as h — oo. 


Kernel methods can also be used for density estimation. In fact, they are 
much more commonly used to estimate PDFs than to estimate CDFs. For 
density estimation, we choose a function K(x) that is not only continuous but 
also differentiable and define the kernel function k(x), often simply called the 
kernel, as K’(x). Then, if we differentiate equation (15.60) with respect to a, 
we obtain the kernel density estimator 


AOR D e] (15.61) 
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Like the kernel CDF estimator (15.60), the kernel density estimator depends 
on both the choice of kernel k and the bandwidth h. It turns out that the 
choice of kernel is much less critical than the choice of bandwidth. One very 
popular choice for k is the Gaussian kernel, which is just the standard nor- 
mal density ¢. It gives a positive (although perhaps very small) weight to 
every point in the sample. Another commonly used kernel, which has certain 
optimality properties, is the Epanechnikov kernel, 


3(1 — 22/5) 
4v5 
This kernel gives a positive weight only to points for which |(a;—2)|/h < v5. 


In practice, the Gaussian and Epanechnikov kernels generally give very similar 
estimates if they are based on similar values of h. 


k(z) = for |z| < v5, 0 otherwise. 


Choosing the Bandwidth 


The kernel density estimator (15.61) is sensitive to the value of the bandwidth 
parameter h, and there is a very large and highly technical literature on how 
best to choose it. See Silverman (1986), Härdle (1990), Wand and Jones 
(1995), or Pagan and Ullah (1999) for introductions to this literature. The 
estimator f(x) is biased, unless the density is genuinely constant, which is 
almost never the case, and too large a value of h gives rise to oversmoothing. 
This suggests that, to make bias small, h should be small. However, when h is 
too small, the estimator suffers from undersmoothing, which implies that the 
variance of fa(x) is large. Thus any choice of h inevitably involves a tradeoff 
between the bias and the variance. This suggests that we should choose A to 
minimize the expectation of the squared error, defined as 


A 


E(fa(x) — f(a)y = (Ef,(2) — HOM + Var(fn(x)), (15.62) 


that is, the square of the bias of fala) plus its variance. If we are interested 
in the entire density rather than just the density at a single point, which is 
often but not always the case, then we would like to minimize the integral 
over all x of either side of equation (15.62). 


Under fairly general regularity conditions, it can be shown that any h 
that minimizes the expectation (15.62) or its integral must be proportional 


to n—!/5: see Exercise 15.26. The factor of proportionality depends on the 


true distribution of the data. Two popular choices for h are 


h = 1.059sn 7/5, and (15.63) 
h = 0.785 (4.75 = Gos)n—V5, (15.64) 


where s is the standard deviation of the z+, and g.75 — ĝ.25 is the difference 
between the estimated .75 and .25 quantiles of the data, which is known as 
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the interquartile range, or IQR. When the data are approximately normally 
distributed, it makes sense to use s to measure the spread of the data. In 
fact, the value of h given in equation (15.63) is optimal for data that are 
normally distributed when using a Gaussian kernel. The factor of 1.059 is 
really (4/ 3)" 5 a quantity that appears in the proof of optimality. When the 
data have thick tails, s tends to overestimate the spread, and it is better to 
use the interquartile range. Note that the factor of 0.785 in equation (15.64) 
is 1.059 divided by 1.349, which is the interquartile range for the standard 
normal distribution. Thus, if the data were normally distributed, both s and 
IQR/1.349 would be estimates of o. 


Although the values given in equations (15.63) and (15.64) should work quite 
well in many cases, they may tend to oversmooth a bit when the data are 
strongly skewed or bimodal. Therefore, as a rule of thumb, Silverman (1986) 
suggests using 

h = 0.9 min(s, IQR/1.349)n 7t. (15.65) 


This is the minimum of the values defined in equation (15.63) and (15.64), 
but with the factor of 1.059 replaced by 0.9 in order to reduce the risk of 
oversmoothing. 


It should be noted that the bandwidths appropriate for kernel estimation of 
densities are not appropriate for kernel estimation of CDFs. Under the same 
(extremely strong) assumptions that led to the value h = 1.059sn71/5 for 
density estimation, it can be shown that h = 1.587 sn7!/3 is optimal for CDF 
estimation. However, h = 1.3sn7 13 may be a better choice if interest centers 
on tail quantiles. See Azzalini (1981) or Wand and Jones (1995). 


An Illustration of Kernel Density Estimation 


Figure 15.1 shows an estimated density for daily percentage returns on IBM 
common stock. It is based on 9939 observations from July, 1962 to December, 
2001. A Gaussian kernel with h given by equation (15.64) was used. We 
also tried using the somewhat larger value of h given by equation (15.63), 
the somewhat smaller value (Silverman’s rule of thumb) given by equation 
(15.65), and an Epanechnikov instead of a Gaussian kernel. The alternative 
density estimates were so close to the one shown in the figure that it is not 
worth plotting them, although the peak was very slightly higher when we used 
Silverman’s rule of thumb. Note that we did not estimate fn (x) for every point 
in the sample, but only for 201 evenly-spaced points between —10 and 10. It 
makes sense to do something like this when the objective is simply to plot an 
estimated density and the sample size is large. 


Figure 15.1 also shows a normal density with the same mean and variance as 
the data. This normal density looks very different from the kernel estimate. 
As we noted in Section 15.2, returns data from financial markets commonly 
display excess kurtosis. The kernel density estimates strongly suggest that this 
empirical regularity holds for the IBM stock returns, since the density appears 
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Kernel estimate 
0.30 = Normal approx. ETETETT 
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Figure 15.1 Estimates of the density of IBM stock returns 


to be much more peaked and to have thicker tails than the normal. The thicker 
tails are hard to see in the figure, but they are evident if one looks closely 
at fala) for larger absolute values of x. For example, fr(8) = 0.0006051 and 
fn(—8) = 0.0005470, while the corresponding values for the normal density 
are just 0.0000014 and 0.0000010. 


Figure 15.2 shows the same estimated density as Figure 15.1, plus two others. 
Both of the new estimates used a Gaussian kernel, but with h either four 


i h= 0.1975n~!/51QR aaaansssstsosssas 
ee i h = 0.79n7 1/5 IQR 
P h= 3.16n 7! IQR Speedin 
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Figure 15.2 Effect of h on kernel density estimates 
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times larger or four times smaller than the value given in equation (15.64). 
When A is too large, the estimated density is very smooth, but it is much 
less peaked than the one based on a sensible choice of h. Thus it is evident 
that, for many values of x, these oversmoothed estimates are severely biased. 
In contrast, when A is too small, the estimated density has roughly the same 
shape as before, but it is extremely jagged. Thus it is evident that the variance 
of these undersmoothed estimates is quite high. 


The example in Figures 15.1 and 15.2 involves 9939 observations, which is a 
rather large number. Kernel density estimation cannot be expected to work 
nearly as well when the sample size is small, because there are not enough 
observations in the vicinity of many values of x to obtain good estimates of 
f(x) for those values. 


Nonparametric Regression 


The fitted values from a regression model are estimates of the expectation of 
the dependent variable conditional on the values of the explanatory variables 
for each observation. The linear regression model (1.01) is perhaps the sim- 
plest such model, since it makes use of only one explanatory variable, x;, and 
the expectation of the dependent variable y; conditional on x; is assumed to 
be an affine function of x+. This very strong assumption can of course be re- 
laxed by using powers or other nonlinear transformations of x, as additional 
regressors. An alternative approach is to use a nonparametric regression, 
which estimates E(y; | x+) directly, without making any assumptions about 
functional form. 


The simplest approach to nonparametric regression is kernel regression, a 
technique similar to kernel density estimation. We suppose that two random 
variables Y and X are jointly distributed, and we wish to estimate the condi- 
tional expectation u(x) = E(Y |) as a function of x, using a sample of paired 
observations (yz, £+) for t = 1,...,n. For given x, consider the function G(x) 
defined as 


G(x) =E(Y-I(X <a) = a T y f(y, z) dy dz, 


where f(y,x) is the joint density of Y and X. Let g(x) = G'(x) denote the 
first derivative of G(X). Then 


j= a Goa sa) yF(y|2)dy = f(a)E(Y |2), 


where f(x) is the marginal density of X, and f(y|x) is the density of Y 
conditional on X = x. 


A natural unbiased estimator of G(x) is n`! Xy; yl (at < x), but this, like 
the EDF, is discontinuous and cannot be differentiated. As with the kernel 
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CDF estimator (15.60), therefore, we replace this estimator by the biased but 
smooth estimator 


Ĝiaj = + ou k(—*), (15.66) 
£=1 


where K is a cumulative kernel (that is, the CDF of a distribution with 
mean 0 and variance 1), and h is a bandwidth parameter. Defining g;,(x) as 
the derivative of (15.66) and using the kernel estimator (15.61) to estimate 
the marginal density of X leads to the following estimator of u(x): 


in (2 n(x) 


fala) 
This is called the Nadaraya-Watson estimator, and it simplifies to 
D Yeke i =i 
po T i =k( i 15.67 


where k = K’ is a kernel function. 


The Nadaraya-Watson estimator is the solution to the estimating equation 


D ke (ye — Îûn(£)) = 0. 


This can be thought of as the empirical counterpart of a weighted average of 
the elementary zero functions y; — u(x). But these elementary zero functions 
do not have mean 0, because the conditional expectation of y: is not u(x) but 
u(x). This is evidently a source of bias. The correct, but infeasible, zero 
function would instead be ys — u(x). 


A better approximation to the correct zero function is given by the two- 
term Taylor expansion u(x) + y’(x)(a_ — x), in which both p(x) and p(x) 
are unknown. Both of these unknowns can be estimated simultaneously by 
solving the two estimating equations 


So k(ve— ule) -paa 2) = 0. and 
n = (15.68) 
`> ki(æı — £) (ye — u(x) — pw’ (x) (a, — x)) = 0. 


These estimating equations are at least approximately correct, because the 
random variable Y — E(Y |X), of which the y — u(x) — p’(x)(a, — x) are 
approximate realizations, is uncorrelated with X — x. The simplest way to 
solve equations (15.68) is to run the linear regression 


E” Yt = ua) ky!” + ul (x)kh!? (xy — x) + residual, (15.69) 
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so as to obtain the locally linear estimator of u(x), which is just the first 
estimated coefficient. Regression (15.69) is to be run for every value of x for 
which we wish to estimate E(Y |x). Taylor expansions with more terms can 
be handled simply by adding additional regressors of the form ki/ (2, —2)' 
to (15.69), with i an integer greater than 1. See Fan and Gijbels (1996) for a 
detailed discussion of the numerous other methods of kernel regression. 


Up to this point, we have assumed that there is just one regressor. Although 
kernel regression can be used when there are several regressors, its perfor- 
mance tends to become much worse as the number of regressors k increases. 
Intuitively, the reason for which it suffers from this curse of dimensionality 
is that the fraction of sample points that are close to any point at which we 
wish to evaluate the conditional expectation declines rapidly as k increases. 
In consequence, when k is greater than 1 or 2, we generally need a very large 
sample for estimates to be at all precise. 


As an example, suppose that the regressors follow independent N(0,1) dis- 
tributions, the point æ is the origin, and we define “close” to mean that the 
Euclidean distance between a; and the origin is no greater than 0.5. As the 
notation indicates, x and x; are now, in general, vectors. When k = 1, the 
fraction of the a; that are close to the origin in this sense is 0.383; when 
k = 2, it is 0.118; when k = 3, it is 0.031; and so on. See Exercise 15.29. This 
example is typical. In general, the proportion of the sample points for which 
£, is close to any specified a decreases rapidly as k increases. 


Cross-Validation 


The optimal choice of h for kernel regression is not, in general, the same 
as for kernel density estimation. When there are k regressors, h should be 
proportional to n—!/('+4). The optimal choice of h depends on a number 
of things, including the values of x for which we wish to compute E(Y | 2), 
the kernel, and the shape of the true (but unknown) regression function. 
Consequently, there is no widely-used rule of thumb for choosing h like the 
ones we discussed for kernel density estimation. Instead, it is customary to 
choose h by some sort of data-based method. One popular approach is to use 
a technique called cross-validation, which we now discuss in the context of 
kernel regression.’ 


Suppose we choose a bandwidth h and calculate a kernel estimate p (x+) for 
each value of x, in the sample. This may be a Nadaraya-Watson estimate, a 
locally linear estimate, or some other type of kernel regression estimate. In 
order to compute g(a), we make use of the values k((x; — xs)/h) of the 
kernel function for all s = 1,...,n. As h — 0, these values tend to 0 for all s 
such that zs # x. If the only such s is t itself, it follows that ĝa(x+) tends 
to y as h — 0. In the event of ties, (a) tends to the average of the ys 


3 Cross-validation can also be used to choose the bandwidth for kernel density 
estimation; see Silverman (1986). 
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for which s is such that x, = x. The residual (24) — y: thus tends to 0 in 
the former case, and to the deviation of y; from the mean of the y, with tied 
values of x, in the latter case. 


This rules out the sum of the squares of the ĝa (x+) — yz as a useful criterion 
function for the choice of h, since it tends monotonically to a lower limit as 
h — 0. Instead, it is common to use what is called a leave-one-out estimator. 
We encountered such estimators in Section 2.6, in connection with leverage. 
Here, the estimator has the same form as a regular kernel estimator, except 
that observation t is omitted when we estimate E(Y | x+). Thus there is no 
tendency for the leave-one-out estimate g (xz) to converge to y+ as h — 0. 
Otherwise, the leave-one-out estimator has exactly the same properties as the 
ordinary kernel estimator when g is not one of the sample points. 


We define the cross-validation function by the formula 


2 


CV(h) = do wla) (ye — Gp (@2))- (15.70) 


Here w(x) is a weight, which could just be 1 for all observations. When we 
use cross-validation, we evaluate CV(h) for a number of values of h and pick 
the value that minimizes it. This makes sense, because, when the weights 
are chosen appropriately, the cross-validation function (15.70) provides a rea- 
sonable way to estimate the average squared error of ĝa (x) over the range of 
values of x in which we are interested. It is attractive to use nonconstant 
weights if we are more interested in obtaining good estimates of E(Y | x) for 
some values of x than for others. The weight might even be 0 for values of x; 
that are far from the values of x in which we are interested. 


A Numerical Example 


It is instructive to see how kernel regression works in practice. For purposes 
of illustration, we generated 400 observations from an artificial DGP that was 
linear for x; below a certain value and quite nonlinear beyond that point. It 
is obvious that a linear regression model fits these data very badly. 


Figure 15.3 shows the data and two sets of Nadaraya-Watson kernel estimates 
based on an Epanechnikov kernel. The first set, shown as a solid line, used 
a baseline bandwidth h = sn—!/5, which, by analogy with results for kernel 
density estimation, seems like a reasonable value to start with. Although these 
estimates look sensible for most values of x+, they perform poorly for extreme 
values. In particular, they severely overestimate E(Y |x) for the largest values 
of x. This happens because, when x is large, there are few or no values of x 
greater than x. Consequently, most of the values of y, of which (x) is a 
weighted average are associated with values of x; smaller than zx. 


The second set of estimates shown in Figure 15.3 used a value of h chosen by 
cross-validation. The only way to avoid severely overestimating E(Y | x) for 
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Yt 


Baseline h = sn72/> = 0.9803 
tasataan Cross-validation h = 0.2498 


Tt 


Figure 15.3 Nadaraya-Watson kernel regression using simulated data 
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the more extreme values of x is to make the bandwidth quite small, and indeed 
the value of h given by cross-validation is much smaller than the baseline 
value. But making h small causes ĝp(x) to wiggle around much more than 
seems reasonable. Thus neither set of Nadaraya-Watson estimates is at all 


satisfactory. 


Figure 15.4 shows the same data and two sets of locally linear kernel estimates, 
both also based on an Epanechnikov kernel. These estimates are much more 


Yt 


Baseline h = sn—!/5 = 0.9803 
e Cross-validation h = 0.6920 


Tt 


Figure 15.4 Locally linear kernel regression using simulated data 
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plausible than the previous ones. The h chosen by cross-validation is smaller 
than the baseline value, but it is large enough that %,(a) is generally quite 
smooth, and there is not much difference between the two sets of estimates. 
The values of the cross-validation function (15.70) provide further evidence 
that the locally linear estimates are better than the Nadaraya-Watson ones. 
For the baseline value of h, these values are 2.4947 and 2.8525, respectively. 
For the optimal values of h, they are 2.4693 and 2.4960. 


Our treatment of nonparametric estimation is necessarily very superficial. 
Much more detailed discussions may be found in Hardle (1990), Yatchew 
(1998), and Pagan and Ullah (1999). There is a vast literature on techniques 
for nonparametric regression. In addition to the references already cited, see 
Green and Silverman (1994), Simonoff (1996), Eubank (1999), and Loader 
(1999), among others. 


Assessing the Specification of Parametric Models 


Nonparametric methods can be useful even when we are primarily interested in 
estimating a parametric model. Graphical methods can be especially valuable. 
Looking at the fitted values from a kernel regression may suggest what sort 
of parametric nonlinear regression function could be expected to provide a 
good fit. Graphing the fitted values from a parametric model alongside those 
from a kernel regression may indicate in what respects, if any, the parametric 
model needs improvement. 


More formal methods exist for testing the validity of a parametric regression 
model using the evidence provided by a nonparametric one. In fact, many 
testing procedures have been proposed. Some of these are explicitly based on 
the J and P tests that were discussed in Section 15.2. Examples include the 
tests proposed by Wooldridge (1992) and Delgado and Stengos (1994), neither 
of which uses kernel regression to estimate the nonparametric model. These 
tests require that the parametric null and nonparametric alternative models 
should be nonnested. Many other tests do not have this requirement, and 
so they can be used to test the null hypothesis that E(Y | X) has a specific, 
parametric, functional form. Examples include Zheng (1996), Li and Wang 
(1998), Ellison and Ellison (2000), and Horowitz and Spokoiny (2001), all 
of which use some variant of kernel regression. Tests that do not use kernel 
regression have been proposed by Yatchew (1992) and Hong and White (1995), 
among others. 


Although most of these tests are conceptually simple, and some of them are 
also simple to compute, their asymptotic validity generally depends on tech- 
nical assumptions that may be difficult to verify. Moreover, their finite-sample 
performance, as asymptotic tests, is often not very good. For both these 
reasons, it is highly desirable to bootstrap them. Even if a test statistic is not 
asymptotically pivotal, bootstrap P values are almost always asymptotically 
valid. If a test statistic is asymptotically pivotal, as is the case for the tests 
proposed in all of the papers cited above when the required conditions hold, 
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then bootstrap P values should be more accurate than asymptotic P values, 
in the sense that we discussed at the end of Section 4.6. 


It is quite easy to bootstrap this sort of test statistic. For example, consider 

the statistic 

(y— #)"(y— &) -(y—9)'(y-B) 
(y — &)'(y — &)/(n— k) ) 


which is closely related to the statistics proposed by Yatchew (1992) and Hong 
and White (1995). In expression (15.71), y is the vector of observations on 
the dependent variable, & is the vector of fitted values from the parametric 
model, y is the vector of fitted values from the nonparametric model, and k is 
the number of parameters. When the nonparametric model involves the same 
regressors as the parametric one, we would expect this statistic to be positive 
if the bandwidth for the nonparametric regression has been chosen sensibly. 
Thus we want to reject the null hypothesis whenever expression (15.71) is 
positive and sufficiently large. 


(15.71) 


When the null hypothesis is a static regression model, it is natural to specify 
the bootstrap DGP as 
y“ =ĉ2+uĂ, 


where u* is an n-vector with typical element uf obtained by resampling the 
rescaled residuals from the parametric model. When the null hypothesis is a 
dynamic model, the vector y* should be generated recursively, as in equation 
(4.65). For each bootstrap sample, we estimate both the parametric model 
and the nonparametric one and then use the two sets of estimates to compute 
the test statistic (15.71). A bootstrap P value is then computed in the usual 
way as the proportion of the bootstrap statistics that are larger than the 
actual test statistic; see equation (4.61). 


Although this procedure is conceptually straightforward, it can be compu- 
tationally costly. The cost is high whenever the parametric model involves 
nonlinear estimation and/or the bandwidth for the nonparametric model is 
chosen by cross-validation. In general, we recommend using simpler proce- 
dures, such as the RESET test, F tests for omitted powers and cross-products 
of the regressors, and nonnested hypothesis tests, prior to explicitly testing 
a parametric model against one or more nonparametric alternatives. Doing 
the latter makes sense only if the simpler procedures fail to find conclusive 
evidence of misspecification. 


15.6 Final Remarks 


It is difficult to overemphasize the importance of testing the specification of 
econometric models thoroughly before using them for any purpose. For this 
reason, several procedures for model specification testing have been discussed 
in this chapter and elsewhere in the book. Many of these procedures are based 
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on artificial regressions, because they require estimation of the model under 
test only, and artificial regressions are generally quite easy to set up. 


We must use caution when performing more than one specification test on 
the same model. Even when every one of the tests is exact, which is very 
rarely true in practice, the probability that at least one test rejects a correctly 
specified model by chance can be quite large; see Exercise 15.30. As readers 
are asked to show in that exercise, it is easy to control the overall significance 
level of several tests taken together if all the test statistics are independent. 
However, it is not at all easy to do so without sacrificing power when the test 
statistics are not independent, which they generally are not; see Savin (1980) 
for an introduction to this topic. 


For more detailed treatments of some aspects of model specification testing 
in econometrics, see Godfrey (1988) and White (1994). The second of these 
books is very much more advanced than the discussion in this chapter. 


15.7 Appendix: Test Regressors in Artificial Regressions 


In this appendix, we sketch a proof of why the three conditions R1-R3 given 
in Section 15.2 make it admissible to base an asymptotic test on the artificial 
regression (15.05). We assume that the DGP belongs to the model M and 
is characterized by the parameter vector 09. We deal only with the case 
in which the asymptotic covariance matrix of the estimator @ is given by 
equation (15.02). The other case, in which it is given by equation (15.03), is 
dealt with in Exercise 15.4. 


The explained sums of squares from the restricted artificial regression (15.01), 
evaluated at the root-n consistent estimator 0, and the unrestricted artificial 
regression (15.05) can be written as 


r'Prr and £'Prat, 
respectively, where Pr is the orthogonal projection on to the columns of R, 


and Pk,ź is the orthogonal projection onto the columns of R and Z jointly. 
The difference between the two explained sums of squares is therefore 


ú (PR ż — Ph)ó = ú! Pyrnźý = Ýń MękRŹ(Ź'MgĶź) Ź'Męgř, (15.72) 
where Mg = I — Pk. Note that expression (15.72) could also be computed as 


the difference between the sums of squared residuals from regressions (15.01) 
and (15.05). 


Now consider the r-vector n712 TM RÝ. It is equal to 
nP É ó -nÉ Rn PRR IRS. (15.73) 
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A Taylor expansion of the first term in expression (15.73) around the true 
value 0o gives 


n V2 ZT = nV? Zl ro — n`! Zd Ro n/?(6 — 0o) + O,(n7*/2), (15.74) 


where ro = r(0o), Ro = R(O0), and Z = Z(0o). Here we have used condi- 
tion R3 in the evaluation of the Jacobian of Z'(@)r(0). By the requirement 
that an artificial regression admits one-step estimation, the second term in 
expression (15.73) is equal to 


-n7t ŹTR n! (Ô — 6) + O,(n-/”), 


Because Ó is root-n consistent, we can replace n-1Z'R in this expression by 
n1 Zo Ro + Op(n—1/?), to obtain 


—n-1Zo Ro n? (Ô — 0o) + n7t Zo Ro nt? (É — 0o) + O(n"). (15.75) 


When we add expressions (15.74) and (15.75), the terms that involve É cancel, 
and so the complete expression (15.73) becomes 


n 2 Zo ro = n~! Zo Ro n!/2(ĝ = 0o) + Osan) 
= n"? Zd ro — n-1Zo Ro n/?(Ro Ro) | Ro ro + Opn?) 
= n~t? Z} Mr, ro — OG), 


In the second line here, we have used the one-step property for Oo rather 
than @, since @p is trivially root-n consistent for itself. Thus we conclude that 


nV? ÉT = n? Zð Mr, ro + Op(n-/?) (15.76) 
for all É such that 6 — 8o = O,(n~/?). 
The asymptotic covariance matrix of the vector n~t? Zd MR, To is 


Var (plim n7"? Zd Mrs ro) = plim + Zo? Mr, roro Mr, Zo- 


n— CO n— oo 


If we replace M r, by I— Ro (Ro Ro) t Ro’, the right-hand side of this equation 
becomes 


plim 4 Zo roro Zo — plim + Zo. Ro(Ro Ro) Ro rord Zo 
n—- oo n— Co 


— plim + Zo roro Ro( Ro Ro)” Ro Zo 


n— Co 


+ plim + Zo'Ro(Ro Ro) Ro roro Ro( Ro Ro) ‘Ro Zo. 


n— Co 
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By making use of condition R2, we can replace the limits of expressions like 
n—!Ro'roroRo by those of expressions like n~t Ro' Ro. When we do this, the 
rather lengthy expression above collapses to 


plim G ZJ Zo — + ZJ Ro( Rd Ro) RI Zo) = plim + ZJ Mpr, Zo. (15.77) 


n— Co n— Co 


The root-n consistency of É and the result (15.76) imply that the asymptotic 
covariance matrix of nV AL is equal to the right-hand side of equation 
(15.77) for all @ such that Ô — 0) = O,(n~'/?). Moreover, the consistency 
of 6 also implies that 


plim + Z'MRZ = plim + Zo Mr, Zo. 


n— CO n— oo 


It follows from this result and the result (15.76) that the test statistic (15.72) 
is asymptotically equal to 


(n~! rd Mr, Zo) (n~! Zo Mr, Zo) (n! Zo Mr, ro). (15.78) 


This is a quadratic form in the r-vector n~!/? Zo M. R,To and the inverse of 
its asymptotic covariance matrix. By Theorem 4.1, the statistic (15.78) must 
be asymptotically distributed as y?(r). 


Calculations very similar to those above can be used to show that, when the 
asymptotic covariance matrix of 0 is given by equation (15.03), F statistics 
computed from the unrestricted artificial regression (15.05) and the restricted 
one (15.01) evaluated at É have their namesake distributions asymptotically 
under the hypothesis that the true DGP is contained in the model M. We 
leave the proof of this as Exercise 15.4. 


15.8 Exercises 


15.1 If the linear regression model y = X@ + u, with error terms uz ~ IID(0, o°), 
is estimated using the n x l matrix W of instrumental variables, an artificial 
regression that corresponds to this model and this estimator is the IVGNR 


y — XB = Pw Xb + residuals. 


Suppose that we wish to test whether the n-vector z is predetermined with 
respect to the error terms in u, that is, whether plim ntz u = 0. Show that 
the obvious testing regression, namely, 


y — XB = PwXb-+ cz + residuals, 
does not satisfy the three conditions given in Section 15.2 for a valid testing 


regression. What other artificial regression could be used to obtain a valid 
test statistic? 
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15.2 


15.3 


15.8 


Show that the t statistic for y = 0 in regression (15.09) is numerically identical 
to the t statistic for c = 0 in regression (15.10). 


Suppose that the dependent variable y is generated by the DGP y = X@Go+u, 
u ~ N(O, cal), where the n x k matrix X is independent of u. Let z bea 
vector that is not necessarily independent of u, but is independent of Mx wu. 
Show that the t statistic on z in the linear regression y = X8 +cz +u follows 
the Student’s t distribution with n — k — 1 degrees of freedom. 


Let (15.01) be an artificial regression corresponding to a model M and an 
asymptotically normal root-n consistent estimator 6 of the parameters of M, 
with the asymptotic covariance matrix of ĝ given by (15.03). Show that, 
whenever É is a root-n consistent estimator, r times the F statistic for the 
artificial hypothesis that c = O in the artificial regression (15.05) is asymp- 
totically distributed as y?(r) under any DGP in M. 


Suppose the vector y is generated by the nonlinear regression model (6.01), 
and Z(8) is an n x r matrix such that 


plim + ZT(8)u = 0. 


n— oo 


Show that n times the uncentered R? from regression (15.22) is asymptotically 
distributed as x7(r). 


Consider a fully parametrized model for which the tt! observation is character- 


ized by a conditional density function f;(y’,@), where the vector y* contains 
the observations yi,..., yz on the dependent variable. The density is that 
of yz conditional on y’~*. Let the moment function m;(yz,@) have expecta- 
tion zero conditional on yt when evaluated at the true parameter vector 8o. 
Show that 


E(mt(yt, 90) Gr(O0)) = -e (Se) ae 


where G;(0) is the row vector of derivatives of log f;(y’,@), the contribution 
to the loglikelihood function made by the t*® observation, and Om: /00(0) 
denotes the row vector of derivatives of mz(yz,@) with respect to @. All 
expectations are taken under the density f;(y’,@). Then explain why this 
result implies equation (15.26) under conditions R1 and R2 of Section 15.2. 
Hint: Use the same approach as in Exercise 10.6. 


Consider the following artificial regression, which was originally proposed by 
Tauchen (1985): 

m = Gb! + ce + residuals. 
Show that the t statistic for c’ = 0 from this regression is numerically identical 
to the t statistic for c = 0 from the OPG regression (15.24). Hint: See 
Exercise 4.8. 
Show that the regressor in the testing regression (15.32) is asymptotically 
orthogonal to the regressors in the OPG regression (15.27), when all regressors 
are evaluated at root-n consistent estimators B and §. Note that two vectors 
a and b are said to be asymptotically orthogonal if plim n-ta'b=0. 


Prove that the t statistic from regression (15.32) is asymptotically equivalent 
to the statistic 74 defined by (15.33). 
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15.10 


15.11 


15.12 


15.13 
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Show also that the statistics 73 and 74 are asymptotically independent under 
the null of normality. 


Suppose that you have a sample of n IID observations uz, t = 1,...,n, on 
a variable supposed to follow the standard normal distribution. How would 
you test the null hypothesis of normality against alternatives allowing the uz 
to be skewed, to have excess kurtosis, or both? 


Suppose now that the variance of the uz is unknown and must be estimated 
from the sample. How would you now test the null of normality against the 
same alternatives? 


This question uses data on monthly returns for the period 1969-1998 for 
shares of General Electric Corporation from the file monthly-crsp.data. These 
data are made available by courtesy of the Center for Research in Security 
Prices (CRSP); see the comments at the bottom of the file. Let Ry denote 
the return on GE shares in month t. For the entire sample period, regress 
Rı on a constant and d, where d is a dummy variable that is equal to 1 
in November, December, January, and February, and equal to 0 in all other 
months. Then test the residuals for evidence of skewness and excess kurto- 
sis, both individually and jointly. Use asymptotic tests based on normalized 
residuals and tests based on the OPG regression. 


Consider the classical linear regression model 
yt = ba + Bere + G3ae3 + ut, ue ~ NID(0, 0°), 


where x;2 and x43 are exogenous variables, and there are n observations. 
Write down the contribution to the loglikelihood made by the tt? observation. 
Then calculate the matrix M (ô) of which the typical element is expression 
(15.36) evaluated at the ML estimates. How many columns does this matrix 
have? What is a typical element of each of the columns? 


Explain how to compute an information matrix test for this model using the 
OPG regression (15.25). How many regressors does the test regression have? 
What test statistic would you use, and how many degrees of freedom does it 
have? What types of misspecification is this test sensitive to? 


Show that the J statistic computed using regression (15.40) is given by 
(n — kı — 1)'/?y'Mx Pzy 
(y’Mxy y'PzMx Pzy — (y'Mx Pzy)”) 


J= 


1/2’ 


Use this expression to show that the probability limit under hypothesis Hı 
of n7! times the denominator is 


sox 1 
oô plim = Bo X'Pz Mx Pz Xo, 


n— Co 


where J and oå are the true parameters. 


Consider the nonnested linear regression models given in equations (15.37). 
Suppose that just one column of Z does not lie in 8(X). In this special case, 
how is the J statistic for testing Hı from regression (15.40) related to the 
F statistic for y’ = 0 in the inclusive regression (15.39)? 
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15.14 


15.15 


15.16 


15.17 


15.18 


15.19 


How is the P statistic from equation (15.47) related to the J statistic from 
equation (15.46) when the regression function x(@) for the Hı model can be 
written as X8? 


The P test regression (15.47) can be interpreted as a Gauss-Newton regression 
for testing a moment condition. Make this moment condition explicit and 
explain why it makes sense. 

This question uses data from the file consumption.data. As in previous exer- 
cises that use these data, c is the log of consumption, and y; is the log of 
disposable income. All models are to be estimated, and all tests calculated, 
for the 176 observations from 1953:1 to 1996:4. 


Using ordinary least squares, estimate the ADL model in levels, 
Ct = a+ Ber_1 + oye + V1 ye—-1 + Ut, (15.79) 
and the ADL model in first differences, 
Act = a! + B'Aqe_1 + Aye + V1 Ayt + ut, (15.80) 


where Ac; = ct — ce-1 and Ayt = yt — Yt-1- 


Test each of these two models against the other using a J test. Then test both 
models against a more general model that includes both of them as special 
cases. Report asymptotically valid P values for all four tests. 


Calculate semiparametric bootstrap P values for the four tests of the previous 
exercise, using the procedure discussed in Section 4.6. Do the bootstrap tests 
yield the same inferences as the asymptotic ones? What can you tentatively 
conclude from these results? 

Consider the two nonnested linear regression models (15.37). An encompass- 
ing test can be based on the estimate of the error variance of model Hə rather 
than on the estimates of the parameters y. Let 63 be the usual ML estimate 
obtained from estimating Hz. Compute the expectation of ô? under the DGP 
in model Hı with parameters 8 and ga. Let 53 be a consistent estimate of this 


expectation based on the estimates of @ and g obtained by estimating Hy. 


Show that n!/ 2(63 — G3) is asymptotically equal to a random variable that 
is proportional to u'Mx Pz XBo, in the notation of equation (15.41). What 
does this result imply about the relationship between the variance encom- 
passing test and the J test? 


Consider the binary response models 
Hi: E (yt | Qt) = F (Xb), and 
Hə: E(yt| Qt) = Fo(Zty), 


where F\(-) and F2(-) may be any transformation functions that satisfy con- 
ditions (11.02). Starting from the artificial comprehensive model 


E(yt|Q¢) = (1 — a) Fi (Xb) + oF (Ziq), (15.81) 


show how to compute a nonnested hypothesis test similar to the P test but 
based on the BRMR (Section 11.3) instead of the GNR.. 
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15.21 
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Let the loglikelihood function of a model to be estimated by maximum like- 
lihood be given by a sum of contributions from each observation: 


(0) = X 40). 
t=1 


Show that, if Ê is a root-n, asymptotically normal, Type 2 MLE of a true 
parameter vector ĝo, then 


Boo (17™ (E) — €(60))) = p(n”), 


Use this result to show that n—*/2(6(6) — Eg(€(8))) is also of order n~!/? as 
n— œ. 


Show that the statistic (15.53) for a Cox test of the nonnested linear regression 
models (15.37) is equal to 


x2 A 
T =n? log (“1 5"#), (15.82) 
oO 


where a, i = 1,2, is the ML estimate of the error variance from estimating 


model H;, and 6? = n~!||[MzPxy||?. Show that the statistic T is asymp- 
totically proportional to the J statistic and also, therefore, to the variance 
encompassing test statistic of Exercise 15.18. Why is it not surprising that 
the Cox test, which can be interpreted as an encompassing test based on the 
maximized loglikelihood, should be asymptotically equivalent to the variance 
encompassing test? 


Show that the asymptotic variance of the statistic (15.82) is 
dot . 1 2 
E plim 7| Mx MzPxyll", 


OF Oa n—-oo 


where o2 = plim ôf = plimn'||MzPxy||?. Use this result to write down a 
Cox statistic in asymptotically N(0,1) form. 


Set up the OPG artificial regression for the Cox test of model Hı against Hə 
in (15.37). In particular, show that, in the notation of Exercise 15.21, the 
typical element of the test regressor can take either of the forms 


a2 42 -2 
log (215%) 1, or 
2 2 
g g A 7 p (15.83) 
1 ôi + Gq a, êi +(MzPxy): 
08 52 52 22g? : 
2 2 1 a 


The file nonnested.data contains 40 observations on artificially generated vari- 
ables #1, £2, z2, z3, and z4. Consider the nonnested linear regression models 


Hı: y= qat + bızı + p2£2 +u, and 


Ho: y= azt +181 + 7222+ 7323+ 7424+. 
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15.24 


15.25 


15.26 


15.27 


15.28 


Perform a simulation experiment where, for each replication, the dependent 
variable y is generated by the DGP in Hı with a, = 0, 6, = 62 = 1, and 
normally distributed error terms with variance ø? = 1. For each simulated 
data set, compute the J statistic, the Cox statistic of Exercise 15.21, and two 
statistics based on an OPG artificial regression, with test regressors of the two 
forms (15.83). Compare the empirical distribution functions of the simulated 
statistics with the nominal N(0,1) distribution. 


Repeat the experiment with the DGP y = #1 +æ2 +0.5z3 +u, u ~ NID(O, 1). 
Given that all four statistics have finite-sample distributions different from the 
nominal asymptotic N(0,1) distribution, how can one use the results of both 
experiments to measure the ability of the statistics to discriminate between 
the DGPs of the two experiments? 


Consider two nonnested models Hı and Hə characterized by the loglikelihood 
functions (15.50). If the true DGP u belongs to Hı and not to H2, show that 


plim ;, + (¢:(61) — £3(62)) > 0, (15.84) 


n— oo 


where 61 and ô» are the MLEs of the two models. Hint: Use Jensen’s inequal- 
ity for the contribution to the two likelihood functions from each observation. 


Let BIC; denote the Bayesian information criterion (15.57) for model H;. Use 
(15.84) to show that BIC; — BIC2 tends to +00 as n — oo. 


Show that the kernel density defined in equation (15.61) is nonnegative and 
integrates to unity. 


For a given choice of bandwidth, the expectation of the estimate fr (x) 
of (15.61) is h~* times the expectation of the random variable k((a — X)/h), 
where X denotes the random variable of which the x; are IID realizations. 
Assume that k is symmetric about the origin. Show that the bias of fh (x) is 
independent of the sample size n and roughly proportional to h? for small h. 
More formally, this means that the bias is O(h?) as h — 0. 


Show also that the variance of f,(a) is of order (nh)™t as n — oo and 
h — 0. Why do these facts imply that the bandwidth h that minimizes the 
expectation of the squared error of f(x) must be of order n—*/5 as n = 00? 


A subsample of the data used for Figures 15.1 and 15.2 can be found in the file 
daily-crsp.data, for the period from January 1989 to December 1998. Estimate 
the density of the percentage daily returns on IBM stock (the data in the file 
times 100) using the same bandwidths as those used for the Figures, using 
both Gaussian and Epanechnikov kernels. Also perform the estimation using 
the uniform kernel k(z) = 1/(2v3) for |z| < v3 and k(z) = 0 outside this 
range. Finally, compute a histogram with bin width equal to the bandwidths 
of the kernel estimators. Graph your results. What conclusions can you draw 
from them? 


Regress the IBM daily returns in daily-crsp.data on the returns for the CRSP 
value-weighted index in the same file by nonparametric regression. Since these 
data have fat tails, you will find it necessary to trim the tails. A reasonable 
way to do this is to eliminate from the sample all observations for which the 
return on the index is greater than 2.6 in absolute value. Compute both the 
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15.30 


Testing the Specification of Econometric Models 


Nadaraya-Watson estimator and the locally linear estimator. For the former, 
use the bandwidth (15.64) computed using the IQR of the trimmed index 
returns; for the latter a much greater value, around 1, is more appropriate. 


Compute the cross-validation function (15.70), with weights equal to 1, for 
both the Nadaraya-Watson and locally linear estimators. For the former, it 
should be enough to compute the function in the neighborhood of the value 
given by (15.64), but for the latter you should explore larger values. Finally, 
compare your results with an OLS regression of the IBM returns on a constant 
and the index returns. 


Suppose that the k-vector x, ~ N(0,1I). What is the probability that the 
Euclidean distance between æt and the origin is less than 1 if k = 1? What 
isitifk=2,k=3, and k= 4? 

Let Tı and 72 each be distributed as N (0, 1), with correlation p. For p = —0.9, 
—0.5, 0, 0.5, and 0.9, generate at least 10,000 realizations of 7; and m72 and 
calculate the proportion of the time that either statistic is greater than the 
.95 quantile of the standard normal distribution. What does this experiment 
tell us about the overall significance level when we perform two tests that are 
not independent? 
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